WO2020182042A1 - Keyword sample determining method, voice recognition method and apparatus, device, and medium - Google Patents

Keyword sample determining method, voice recognition method and apparatus, device, and medium Download PDF

Info

Publication number
WO2020182042A1
WO2020182042A1 PCT/CN2020/077912 CN2020077912W WO2020182042A1 WO 2020182042 A1 WO2020182042 A1 WO 2020182042A1 CN 2020077912 W CN2020077912 W CN 2020077912W WO 2020182042 A1 WO2020182042 A1 WO 2020182042A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
voice
sample
speech
recognition
Prior art date
Application number
PCT/CN2020/077912
Other languages
French (fr)
Chinese (zh)
Inventor
李敬
Original Assignee
广州市网星信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市网星信息技术有限公司 filed Critical 广州市网星信息技术有限公司
Publication of WO2020182042A1 publication Critical patent/WO2020182042A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the embodiments of the application relate to the field of speech recognition technology, for example, to a method for determining a keyword sample, a method for speech recognition, an apparatus, equipment, and a medium.
  • KWS technology uses a method based on multiple types of neural networks to recognize keywords carried in speech. At this time, it is necessary to collect a large amount of audio data containing pre-defined keywords and non-keywords.
  • the neural network constructed by the audio data pair The parameters are trained, verified and tested, so that the constructed neural network can accurately recognize the keyword information in the user's voice.
  • the keyword training set is obtained by manually recording the corresponding keyword voice to collect a large amount of audio data, which requires a high cost, and requires the recording environment of the collected audio data and the actual location of the predefined keywords.
  • the environment is consistent, which leads to certain limitations in the generation of multiple types of keywords.
  • the embodiments of the present application provide a keyword sample determination method, voice recognition method, device, equipment, and medium, so as to improve the comprehensiveness of keyword sample determination and enhance the accuracy of voice recognition.
  • the embodiment of the application provides a method for determining a keyword sample, and the method includes:
  • Acquire target speech samples including the keywords from an existing speech recognition sample library
  • the keyword voice segment in the target voice sample is determined to obtain the keyword sample.
  • the embodiment of the present application provides a voice recognition method, which includes:
  • the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method
  • the operation corresponding to the keyword is triggered according to the keyword.
  • the embodiment of the present application provides a keyword sample determining device, which includes:
  • Keyword acquisition module set to acquire keywords
  • the target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library
  • the keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
  • An embodiment of the present application provides a voice recognition device, which includes:
  • the voice command acquisition module is set to acquire the user's voice command
  • the keyword recognition module is configured to recognize keywords in the voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device;
  • the operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.
  • An embodiment of the present application provides a device, which includes:
  • One or more processors are One or more processors;
  • Storage device set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the keyword sample determination method described in this application, or implement the voice described in this application recognition methods.
  • the embodiment of the application provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the keyword sample determination method described in this application is implemented, or the voice described in this application is implemented. recognition methods.
  • FIG. 1A is a flowchart of a method for determining a keyword sample provided in Embodiment 1 of this application;
  • FIG. 1B is a schematic diagram of the principle of determining keyword samples in the method provided in Embodiment 1 of this application;
  • 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application
  • 2B is a schematic diagram of the principle of a keyword sample determination process provided in the second embodiment of the application.
  • FIG. 2C is a schematic diagram of a waveform of audio data in a voice sample in the method provided in Embodiment 2 of this application;
  • FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application.
  • 3B is a schematic diagram of the principle of the speech recognition process in the method provided in the third embodiment of this application.
  • FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of the application.
  • FIG. 5 is a schematic structural diagram of a voice recognition device provided in Embodiment 5 of this application.
  • FIG. 6 is a schematic structural diagram of a device provided in Embodiment 6 of this application.
  • the voice interaction control is carried out by recognizing the keywords carried in the user’s voice, it has been widely used in the field of voice recognition, and the keywords at this time can be any kind of keywords that users are interested in in daily life, but the keywords
  • the data set is generally only the keywords publicly used by some companies or institutions for scientific research. It cannot match the keywords of interest in daily life, and it is difficult to find the corresponding voice data sets of keywords of interest.
  • the training data set in any type of speech recognition contains more content. Therefore, in the embodiment of this application, the existing speech recognition sample library is used to find the target speech sample containing the corresponding keyword, and The corresponding keyword speech fragments are intercepted from the voice samples to obtain the corresponding keyword samples.
  • the keyword recognition model obtained through the keyword sample training can identify the keywords contained in the corresponding user’s voice and improve the speech recognition Accuracy.
  • FIG. 1A is a flowchart of a method for determining a keyword sample according to Embodiment 1 of the application.
  • This embodiment can be applied to any situation where a keyword sample for model training needs to be determined.
  • the solution of the embodiment of the present application may be applicable to how to solve the problem of high acquisition cost and limitation of keyword samples.
  • the keyword sample determination method provided in this embodiment can be executed by the keyword sample determination device provided in this embodiment of the application, and the device can be implemented by software and/or hardware, and is integrated in the device that executes the method.
  • the device can be any kind of smart terminal device, such as a laptop, tablet, or desktop.
  • the method may include the following steps:
  • the keyword refers to any words that are of interest to the user in daily life set by the developer in advance according to the voice interaction requirements, and the corresponding trigger operation can be executed by recognizing the keyword in the user's voice.
  • the developer when performing voice interaction control through keyword recognition technology, the developer will specify a keyword according to the development requirements in the voice interaction, which is used to indicate that the corresponding trigger operation is achieved through the keyword; at this time, the developer Input the specified keywords into the device that executes the keyword sample determination method in this embodiment, so that the device obtains the keywords predefined by the developer, so that the corresponding keyword samples can be automatically generated subsequently, and then the set keywords are recognized
  • the model is trained.
  • S120 Obtain target voice samples including keywords from an existing voice recognition sample library.
  • the speech recognition sample library refers to the speech Recognition technology has been pre-built during the development process to store a large number of user voices in multiple fields, that is, the large vocabulary continuous speech recognition (Large Vocabulary Continuous Speech Recognition, LVCSR) system provides multiple scenarios.
  • LVCSR Large Vocabulary Continuous Speech Recognition
  • a large vocabulary sample collection of user speech may be a speech recognition tool library, such as a multi-type speech toolkit under a speech recognition framework such as Kaldi, Sphinx, or HTK.
  • the keyword can be used in the existing speech recognition sample library, that is, the large vocabulary continuous speech recognition system provides multiple scenarios in multiple scenarios.
  • the existing voice recognition sample library It includes a large number of multiple user voices in multiple scenarios, so that the obtained target voice samples are diverse voice samples in multiple scenarios, and can guarantee the samples of the target voice samples obtained in the existing voice recognition sample library The number is large enough to build a training sample set for training the keyword recognition model later.
  • obtaining target voice samples including keywords in an existing voice recognition sample library may include: searching for labeled data in the existing voice recognition sample library The voice samples with keywords are included in and the found voice samples are used as the target voice samples.
  • the voice samples contained in the existing voice recognition sample library can be composed of two parts: corresponding audio data and annotation data; wherein the audio data can represent the frequency, amplitude change and duration of the user’s voice in the voice sample
  • each audio data can be displayed by recording the sound waveform in the corresponding user's voice; the label data can be the number and text information that record the user's voice content.
  • the existing speech recognition sample library can be queried. By traversing each voice sample contained in the existing speech recognition sample library, the annotation data that constitutes each speech sample can be analyzed.
  • Kaldi speech recognition framework Take the Kaldi speech recognition framework as an example to illustrate the search process.
  • a large number of public speech recognition sample libraries are provided under the Kaldi speech recognition framework, such as aishell and thchs30 sample libraries in Chinese, and wsj and librispeech sample libraries in English;
  • the existing speech recognition sample library contains a large number of speech samples composed of audio data and labeled data.
  • the labeled data is as follows: “BAC009S0002W0130 fiscal and financial policies follow immediately"; among them, “BAC009S0002W0130” means that The number of the voice sample composed of the annotation data can clearly indicate the matching relationship between the data and the voice sample; “Financial and financial policies follow immediately” means that the voice sample composed of the annotation data contains textual information of the content.
  • the acquired keyword is "finance”
  • query the existing voice recognition sample library traverse the label data of multiple voice samples contained therein, and extract the label data including the keyword "finance" "" voice sample, such as the content of the above example is the voice sample of "Financial and financial policies follow the climate".
  • the voice sample found is used as the target voice sample.
  • S130 Determine a keyword voice segment in the target voice sample to obtain a keyword sample.
  • the keyword voice segment refers to that the voice sample only carries the voice corresponding to the specified keyword, and there is no voice segment corresponding to the voice of other content.
  • this embodiment recognizes the target voice sample through a specific voice recognition technology to obtain a recognition result representing the voice feature information of the target voice sample, and determine according to the recognition result Find out the speech range of the keywords contained in the target speech sample, and then determine the corresponding keyword speech segment in the target speech sample, and intercept the keyword speech segment in the corresponding speech range in the target speech sample.
  • the keyword voice segment only contains the content of the keyword and the sound feature information, and there is no information other than the keyword, so the keyword voice segment is used as the keyword sample in this embodiment.
  • a large number of target voice samples that include specified keywords in the labeled data in multiple scenarios can be obtained, so from the target voice samples
  • the number of keyword speech fragments determined in is also large enough, so that keyword samples in multiple scenarios can be obtained, so that the corresponding keyword recognition model can be subsequently trained through the keyword samples in multiple scenarios.
  • the technical solution provided in this embodiment obtains target voice samples containing keywords from an existing voice recognition sample library, and intercepts the keyword voice fragments in the target voice samples to obtain keyword samples.
  • the recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios.
  • the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types.
  • diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
  • FIG. 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application
  • FIG. 2B is a schematic diagram of the principle of a method for determining a keyword sample provided in the second embodiment of the application. This embodiment is based on the technical solution provided in the foregoing embodiment. In this embodiment, the process of determining the keyword speech segment in the target speech sample is explained.
  • this embodiment may include the following steps:
  • S220 Acquire target voice samples including keywords from an existing voice recognition sample library.
  • S230 Determine the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target voice sample.
  • the phoneme is the smallest phonetic unit divided according to the speech attributes, which can be analyzed according to the pronunciation action of the user's voice; the phoneme in this embodiment may be multiple initials and finals in the speech composition.
  • a corresponding number is set for each existing phoneme in advance, and is stored in the corresponding phoneme table, so that the target speech sample can be subsequently identified according to the number of each phoneme.
  • the audio data of the target voice sample is the data representing the characteristics of the sound signal such as the frequency, amplitude change, and duration of the user’s voice, that is, the voice data that lasts for a period of time
  • every word uttered by the user contained in the audio data is Matching has a corresponding start and end time range.
  • the start time point refers to the time point at which the user starts to pronounce the keyword in the audio data of the target voice sample
  • the end time point refers to the audio data of the target voice sample The point in time when the user ended sending the keyword.
  • a target voice sample that includes keywords in the tagging data when a target voice sample that includes keywords in the tagging data is obtained in this embodiment, voice recognition is performed on the audio data that constitutes the target voice sample, and since the audio data is voice feature data that lasts for a period of time, And it belongs to a quasi-steady-state voice signal.
  • the framing situation of the audio data will be determined.
  • the length of the voice frame is set to 20ms-30ms.
  • the length of the voice frame in this embodiment is 20ms, and then recognize the phonemes contained in the audio data in each speech frame.
  • the audio data in the target speech sample is recognized according to the preset phoneme number and the length of the speech frame, and the corresponding phoneme recognition result is obtained and determined
  • the range in which the phoneme of the keyword exists in the phoneme recognition result that is, the starting point and ending point of the phoneme of the keyword in the phoneme recognition result, and then according to the set speech frame length and the starting point and ending point in the phoneme recognition result
  • the number of corresponding phonemes determines the start time point and end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample.
  • the keyword is "Finance”
  • the waveform corresponding to the audio data is shown in Figure 2C
  • the phoneme corresponding to the keyword “Finance” is j
  • the phoneme corresponding to the keyword “Finance” is j
  • there may be a certain period of silence between the two characters when the user is speaking so there will be a certain amount of silence between " ⁇ ” and " ⁇ ” in the keywords contained in the audio data.
  • the preset number of mute is “1”
  • the number of j is “17”
  • the number of in is “23”
  • the number of r is "18”
  • the number of ong is “27”
  • the voice frame length is 20ms.
  • each number corresponds to the length of a voice frame.
  • the number 17 of the phoneme “j” corresponding to the keyword “gold” has 4 frames
  • the number 23 of "in” has 7 frames
  • the number 18 of the corresponding phoneme “r” has 3 frames
  • the number 27 of "ong” has 6 frames.
  • the first frame of the phoneme "j" corresponding to "Gold” is the 63rd frame in the entire phoneme recognition result, so
  • the start time point in the audio data of the voice sample is 1.24s, and the end time point is 1.66s.
  • S240 Intercept the audio data between the start time point and the end time point according to the start time point and the end time point to obtain a keyword voice segment.
  • the audio data that is located between the start time point and the end time point can be truncated Audio data segment, that is, in the audio data corresponding to the target voice sample of the above-mentioned "Financial and Financial Policy", intercept the audio data segment between 1.24s and 1.66s, or in the audio data Starting from 1.24s, an audio data segment with a duration of 0.42s is intercepted and used as the keyword voice segment in this embodiment. At this time, the keyword voice segment only contains the voice information of the keyword "finance".
  • S250 Fill the silence data of a preset length before the start time point of the keyword speech segment and after the end time point to obtain a keyword sample.
  • mute data of a preset length can be filled in the positions before and after the obtained keyword speech fragment.
  • the mute data in the embodiment may be data "0" of the preset voice frame length, so as to obtain an independent keyword sample, which is convenient for subsequent differentiation from other voice samples.
  • aishell speech recognition sample library as an example, which contains 178 hours and 400 people's speech samples in multiple fields. At this time, a total of 610 target speech samples containing the keyword "finance" can be found.
  • the keyword sample determination method in the example performs keyword interception on the 610 target voice samples found, and 610 keyword samples with the keyword "financial" can be obtained, and then a diversified keyword sample set is obtained, which is the follow-up The training of the keyword recognition model created certain conditions.
  • the technical solution provided in this embodiment determines the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, and intercepts the audio data of the target speech sample at the start time point and the end time point. Keyword speech fragments between time points to obtain keyword samples to ensure the diversification of keyword samples. There is no need to generate keyword samples by repeatedly recording the keyword voices of multiple users in multiple scenarios, reducing keywords The acquisition cost of the sample improves the comprehensiveness and accuracy of the keyword sample determination.
  • FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application.
  • This embodiment can be applied to any situation of recognizing keywords included in a user's voice instruction.
  • the solution of the embodiment of the present application may be applicable to how to solve the problem of cumbersome training process of the keyword recognition model.
  • the voice recognition method provided in this embodiment can be executed by the voice recognition device provided in the embodiment of this application.
  • the device can be implemented by software and/or hardware, and is integrated in the device that executes the method.
  • the device can It is any kind of smart terminal equipment, such as laptop, tablet or desktop.
  • this embodiment may include the following steps:
  • a voice carrying a keyword corresponding to the operation when the user needs to perform an operation, a voice carrying a keyword corresponding to the operation will be emitted, and the device will generate a corresponding voice command when receiving the voice uttered by the user, and the voice command carries Corresponding keywords;
  • the matching relationship between multiple keywords and different operations is preset according to different application scenarios. For example, in a short video application, you can set different predefined keywords and different video effects In the live broadcast application, you can set predefined keywords to present corresponding gifts in the live broadcast room.
  • S320 Recognize the keywords in the voice instruction through the keyword recognition model.
  • the keyword recognition model is trained in advance by keyword samples determined by the keyword sample determination method provided in the embodiments of the present application.
  • this embodiment acquires keywords pre-specified by the user, queries each voice sample included in the existing voice recognition sample library, determines whether the specified keywords are included in the annotation data that constitutes the voice sample, and then The marked data includes the speech sample of the specified keyword as the target speech sample, and the start time point and the end time point of the keyword phoneme in the audio data phoneme of the target speech sample are determined according to the word phoneme, and cut out at the start time point The audio data segment between and the end time point is used as the keyword speech segment, and then a large number of keyword samples are obtained.
  • a corresponding keyword sample library will be generated.
  • the keyword sample library contains different scenarios under multiple keywords specified by the user and only messages sent by different users. Keyword samples containing the key word voice.
  • a large number of keyword sample pairs contained in the keyword sample library can be used.
  • the pre-set keyword recognition model is trained.
  • the keyword recognition results corresponding to the keyword samples are obtained, and the original Recognize the existing classification loss.
  • the classification loss exceeds the preset loss threshold, repair the keyword recognition model according to the classification loss, and continue to obtain the corresponding keyword samples under the keyword, and enter the key after repair again Keyword recognition is performed in the word recognition model until the classification loss obtained does not exceed the preset loss threshold.
  • the keyword sample corresponding to the next keyword in the keyword sample library is obtained and training is carried out here until the keyword sample library The keyword samples under each keyword contained in are trained to obtain the final keyword recognition model.
  • the keyword recognition model can accurately recognize the keywords in any speech.
  • the voice command can be input into a pre-trained keyword recognition model, and the keyword recognition model parses the voice command to accurately recognize the voice command.
  • the keyword carried in the voice instruction so that the corresponding operation can be performed according to the keyword.
  • S330 Trigger an operation corresponding to the keyword according to the keyword.
  • the carried keyword is analyzed to determine the operation matching the keyword, and then the execution of the operation is triggered to achieve The corresponding voice interactive control.
  • the technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the keyword sample determination method described above, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples.
  • the keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
  • FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of this application. As shown in FIG. 4, the device may include:
  • the keyword acquisition module 410 is set to acquire keywords
  • the target voice acquisition module 420 is configured to acquire target voice samples including keywords from an existing voice recognition sample library
  • the keyword sample determining module 430 is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
  • the technical solution provided in this embodiment obtains target speech samples containing keywords in an existing speech recognition sample library, and intercepts the keyword speech fragments in the target speech samples to obtain keyword samples.
  • the recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios.
  • the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types.
  • diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
  • the keyword sample determining device provided in this embodiment is applicable to the keyword sample determining method provided in any embodiment of the present application, and has corresponding functions and effects.
  • FIG. 5 is a schematic structural diagram of a speech recognition device provided in Embodiment 5 of this application. As shown in FIG. 5, the device may include:
  • the voice instruction acquiring module 510 is configured to acquire the user's voice instruction
  • the keyword recognition module 520 is configured to recognize keywords in voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device provided in the above-mentioned embodiment;
  • the operation trigger module 530 is configured to trigger the operation corresponding to the keyword according to the keyword.
  • the technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the above keyword sample determiner, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples.
  • the keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
  • the voice recognition device provided in this embodiment is applicable to the voice recognition method provided in any embodiment of the above application, and has corresponding functions and effects.
  • FIG. 6 is a schematic structural diagram of a device provided by Embodiment 6 of this application.
  • the device includes a processor 60, a storage device 61, and a communication device 62; the number of processors 60 in the device may be one or more.
  • a processor 60 is taken as an example in FIG. 6; the processor 60, the storage device 61, and the communication device 62 in the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.
  • the device provided in this embodiment can be used to execute the keyword sample determination method or the voice recognition method provided in any of the foregoing embodiments, and has corresponding functions and effects.
  • the seventh embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the method for determining a keyword sample in any of the foregoing embodiments can be implemented.
  • the method can include:
  • the method may include:
  • An embodiment of the application provides a storage medium containing computer-executable instructions.
  • the computer-executable instructions are not limited to the method operations described above, and can also execute the keyword sample determination method or voice provided by any embodiment of the application. Relevant operations in the identification method.
  • This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware.
  • This application can be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in multiple embodiments of the present application.
  • the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding function can be realized;
  • the names of multiple functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Disclosed are a keyword sample determining method, a voice recognition method and apparatus, a device, and a medium. The keyword sample determining method comprises: obtaining a keyword; obtaining, from an existing voice recognition sample library, a target voice sample that comprises the keyword; and determining a keyword voice segment in the target voice sample so as to obtain a keyword sample.

Description

关键词样本确定方法、语音识别方法、装置、设备和介质Keyword sample determination method, voice recognition method, device, equipment and medium
本申请要求在2019年03月13日提交中国专利局、申请号为201910189413.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910189413.1 on March 13, 2019. The entire content of this application is incorporated into this application by reference.
技术领域Technical field
本申请实施例涉及语音识别技术领域,例如涉及一种关键词样本确定方法、语音识别方法、装置、设备和介质。The embodiments of the application relate to the field of speech recognition technology, for example, to a method for determining a keyword sample, a method for speech recognition, an apparatus, equipment, and a medium.
背景技术Background technique
随着市场上的智能音箱不断增多,语音识别领域的相关技术得到了很大的发展和应用。语音识别中的关键词识别(Key Word Spotting,KWS)技术作为语音交互控制的基础也得到了广泛的应用。With the increasing number of smart speakers on the market, related technologies in the field of speech recognition have been greatly developed and applied. Key Word Spotting (KWS) technology in speech recognition has also been widely used as the basis of voice interactive control.
KWS技术中采用基于多类神经网络的方式来识别语音中携带的关键词,此时需要采集大量包含预先定义的关键词和非关键词的音频数据,由该音频数据对构建的神经网络中的参数进行训练、验证和测试,使得构建的神经网络能够准确识别用户语音中的关键词信息。KWS technology uses a method based on multiple types of neural networks to recognize keywords carried in speech. At this time, it is necessary to collect a large amount of audio data containing pre-defined keywords and non-keywords. The neural network constructed by the audio data pair The parameters are trained, verified and tested, so that the constructed neural network can accurately recognize the keyword information in the user's voice.
相关技术中通过人工录制对应的关键词语音,以采集大量的音频数据来得到关键词训练集,需要花费较高的成本,而且要求所采集音频数据的录音环境与预先定义的关键词所在的实际环境一致,从而导致多类关键词的生成存在一定的局限性。In the related technology, the keyword training set is obtained by manually recording the corresponding keyword voice to collect a large amount of audio data, which requires a high cost, and requires the recording environment of the collected audio data and the actual location of the predefined keywords. The environment is consistent, which leads to certain limitations in the generation of multiple types of keywords.
发明内容Summary of the invention
本申请实施例提供了一种关键词样本确定方法、语音识别方法、装置、设备和介质,以提高关键词样本确定的全面性,增强语音识别的准确性。The embodiments of the present application provide a keyword sample determination method, voice recognition method, device, equipment, and medium, so as to improve the comprehensiveness of keyword sample determination and enhance the accuracy of voice recognition.
本申请实施例提供了一种关键词样本确定方法,该方法包括:The embodiment of the application provides a method for determining a keyword sample, and the method includes:
获取关键词;Get keywords;
在已有的语音识别样本库中获取包括所述关键词的目标语音样本;Acquire target speech samples including the keywords from an existing speech recognition sample library;
确定所述目标语音样本中的关键词语音片段,得到关键词样本。The keyword voice segment in the target voice sample is determined to obtain the keyword sample.
本申请实施例提供了一种语音识别方法,该方法包括:The embodiment of the present application provides a voice recognition method, which includes:
获取用户的语音指令;Obtain the user's voice instructions;
通过关键词识别模型识别所述语音指令中的关键词,所述关键词识别模型 预先通过所述的关键词样本确定方法确定的关键词样本训练;Recognizing the keywords in the voice instructions through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method;
根据所述关键词触发与所述关键词对应的操作。The operation corresponding to the keyword is triggered according to the keyword.
本申请实施例提供了一种关键词样本确定装置,该装置包括:The embodiment of the present application provides a keyword sample determining device, which includes:
关键词获取模块,设置为获取关键词;Keyword acquisition module, set to acquire keywords;
目标语音获取模块,设置为在已有的语音识别样本库中获取包括所述关键词的目标语音样本;The target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library;
关键词样本确定模块,设置为确定所述目标语音样本中的关键词语音片段,得到关键词样本。The keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
本申请实施例提供了一种语音识别装置,该装置包括:An embodiment of the present application provides a voice recognition device, which includes:
语音指令获取模块,设置为获取用户的语音指令;The voice command acquisition module is set to acquire the user's voice command;
关键词识别模块,设置为通过关键词识别模型识别所述语音指令中的关键词,所述关键词识别模型预先通过所述的关键词样本确定装置确定的关键词样本训练;The keyword recognition module is configured to recognize keywords in the voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device;
操作触发模块,设置为根据所述关键词触发与所述关键词对应的操作。The operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.
本申请实施例提供了一种设备,该设备包括:An embodiment of the present application provides a device, which includes:
一个或多个处理器;One or more processors;
存储装置,设置为存储一个或多个程序;Storage device, set to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请中所述的关键词样本确定方法,或者实现本申请中所述的语音识别方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the keyword sample determination method described in this application, or implement the voice described in this application recognition methods.
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本申请中所述的关键词样本确定方法,或者实现本申请中所述的语音识别方法。The embodiment of the application provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the keyword sample determination method described in this application is implemented, or the voice described in this application is implemented. recognition methods.
附图说明Description of the drawings
图1A为本申请实施例一提供的一种关键词样本确定方法的流程图;FIG. 1A is a flowchart of a method for determining a keyword sample provided in Embodiment 1 of this application;
图1B为本申请实施例一提供的方法中确定关键词样本的原理示意图;FIG. 1B is a schematic diagram of the principle of determining keyword samples in the method provided in Embodiment 1 of this application;
图2A为本申请实施例二提供的一种关键词样本确定方法的流程图;2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application;
图2B为本申请实施例二提供的一种关键词样本确定过程的原理示意图;2B is a schematic diagram of the principle of a keyword sample determination process provided in the second embodiment of the application;
图2C为本申请实施例二提供的方法中语音样本中的音频数据的波形示意图;2C is a schematic diagram of a waveform of audio data in a voice sample in the method provided in Embodiment 2 of this application;
图3A为本申请实施例三提供的一种语音识别方法的流程图;FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application;
图3B为本申请实施例三提供的方法中语音识别过程的原理示意图;3B is a schematic diagram of the principle of the speech recognition process in the method provided in the third embodiment of this application;
图4为本申请实施例四提供的一种关键词样本确定装置的结构示意图;4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of the application;
图5为本申请实施例五提供的一种语音识别装置的结构示意图;FIG. 5 is a schematic structural diagram of a voice recognition device provided in Embodiment 5 of this application;
图6为本申请实施例六提供的一种设备的结构示意图。FIG. 6 is a schematic structural diagram of a device provided in Embodiment 6 of this application.
具体实施方式detailed description
下面结合附图和实施例对本申请进行说明。附图中仅示出了与本申请相关的部分而非全部结构。此外,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The application will be described below with reference to the drawings and embodiments. The drawings only show a part but not all of the structure related to this application. In addition, if there is no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.
由于通过识别用户语音中携带的关键词来进行语音交互控制,在语音识别领域得到了广泛地使用,而此时的关键词可以是日常生活中任一种用户感兴趣的关键词,但关键词数据集一般仅是一些公司或者机构公开的用于科学研究使用的关键词,无法与日常生活中感兴趣的关键词匹配,难以查找到相应感兴趣的关键词语音数据集,此时相比关键词识别技术,任意类型的语音识别中存在的训练数据集所包含的内容更加丰富,因此本申请实施例中利用已有的语音识别样本库中查找包含相应关键词的目标语音样本,并在目标语音样本中截取出对应的关键词语音片段,得到对应的关键词样本,无需通过录制多个用户在多类不同实际场景下的关键词语音来确定关键词样本,减少了关键词样本的获取成本,提高了关键词样本确定的全面性,有效地减少了关键词样本确定的工作量,通过该关键词样本训练得到的关键词识别模型来识别相应用户语音中包含的关键词,提高了语音识别的准确性。Since the voice interaction control is carried out by recognizing the keywords carried in the user’s voice, it has been widely used in the field of voice recognition, and the keywords at this time can be any kind of keywords that users are interested in in daily life, but the keywords The data set is generally only the keywords publicly used by some companies or institutions for scientific research. It cannot match the keywords of interest in daily life, and it is difficult to find the corresponding voice data sets of keywords of interest. At this time, it is more critical Word recognition technology, the training data set in any type of speech recognition contains more content. Therefore, in the embodiment of this application, the existing speech recognition sample library is used to find the target speech sample containing the corresponding keyword, and The corresponding keyword speech fragments are intercepted from the voice samples to obtain the corresponding keyword samples. There is no need to determine the keyword samples by recording the keyword voices of multiple users in multiple different actual scenarios, which reduces the acquisition cost of keyword samples , Improve the comprehensiveness of keyword sample determination and effectively reduce the workload of keyword sample determination. The keyword recognition model obtained through the keyword sample training can identify the keywords contained in the corresponding user’s voice and improve the speech recognition Accuracy.
实施例一Example one
图1A为本申请实施例一提供的一种关键词样本确定方法的流程图,本实施例可应用于任一种需要确定用于模型训练的关键词样本的情况中。本申请实施例的方案可以适用于如何解决关键词样本的获取成本高且存在局限性的问题。本实施例提供的一种关键词样本确定方法可以由本申请实施例提供的关键词样本确定装置来执行,该装置可以通过软件和/或硬件的方式来实现,并集成在执 行本方法的设备中,该设备可以是任一种智能终端设备,如笔记本电脑、平板或者台式机等。FIG. 1A is a flowchart of a method for determining a keyword sample according to Embodiment 1 of the application. This embodiment can be applied to any situation where a keyword sample for model training needs to be determined. The solution of the embodiment of the present application may be applicable to how to solve the problem of high acquisition cost and limitation of keyword samples. The keyword sample determination method provided in this embodiment can be executed by the keyword sample determination device provided in this embodiment of the application, and the device can be implemented by software and/or hardware, and is integrated in the device that executes the method. , The device can be any kind of smart terminal device, such as a laptop, tablet, or desktop.
在一实施例中,参考图1A,该方法可以包括如下步骤:In an embodiment, referring to FIG. 1A, the method may include the following steps:
S110,获取关键词。S110: Acquire keywords.
其中,关键词是指开发人员预先根据语音交互需求设定的在日常生活中用户较为感兴趣的任意词语,通过在用户语音中识别出该关键词可以执行相应的触发操作。Among them, the keyword refers to any words that are of interest to the user in daily life set by the developer in advance according to the voice interaction requirements, and the corresponding trigger operation can be executed by recognizing the keyword in the user's voice.
在一实施例中,在通过关键词识别技术进行语音交互控制时,开发人员会根据语音交互中的开发需求指定一个关键词,用于指示通过该关键词实现相应的触发操作;此时开发人员将指定的关键词输入执行本实施例中关键词样本确定方法的设备中,使得该设备获取开发人员预先定义的关键词,以便后续自动生成对应的关键词样本,进而对设定的关键词识别模型进行训练。In one embodiment, when performing voice interaction control through keyword recognition technology, the developer will specify a keyword according to the development requirements in the voice interaction, which is used to indicate that the corresponding trigger operation is achieved through the keyword; at this time, the developer Input the specified keywords into the device that executes the keyword sample determination method in this embodiment, so that the device obtains the keywords predefined by the developer, so that the corresponding keyword samples can be automatically generated subsequently, and then the set keywords are recognized The model is trained.
S120,在已有的语音识别样本库中获取包括关键词的目标语音样本。S120: Obtain target voice samples including keywords from an existing voice recognition sample library.
其中,由于语音识别技术相比关键词识别技术更早被多个领域的开发人员进行研究,使得语音识别技术对应的训练数据集中包含的语音数据也更加丰富,此时语音识别样本库是指语音识别技术在发展过程中已经预先构建的存储有大量多个领域下的用户语音的数据库,也就是大词汇量连续语音识别(Large Vocabulary Continuous Speech Recognition,LVCSR)系统中提供的包含多类场景下的用户语音的大词汇量样本集合。示例性的,本实施例中语音识别样本库可以是语音识别工具库,如Kaldi、Sphinx或者HTK等语音识别框架下的多类语音工具包。Among them, because the speech recognition technology has been studied by developers in many fields earlier than the keyword recognition technology, the speech data contained in the training data set corresponding to the speech recognition technology is also richer. At this time, the speech recognition sample library refers to the speech Recognition technology has been pre-built during the development process to store a large number of user voices in multiple fields, that is, the large vocabulary continuous speech recognition (Large Vocabulary Continuous Speech Recognition, LVCSR) system provides multiple scenarios. A large vocabulary sample collection of user speech. Exemplarily, the speech recognition sample library in this embodiment may be a speech recognition tool library, such as a multi-type speech toolkit under a speech recognition framework such as Kaldi, Sphinx, or HTK.
可选的,在获取到开发人员预先指定的关键词时,可以根据该关键词在已有的语音识别样本库,也就是大词汇量连续语音识别系统中提供的包含多类场景下的多个用户语音的大词汇量样本集合中选取出包括该关键词的目标语音样本;此时由于语音识别技术被处于多个领域的开发人员在多类场景下研究使用,因此已有的语音识别样本库中包括大量多类场景下的多种用户语音,使得获取的目标语音样本为多类场景下具备多样化的语音样本,且能够保证在已有的语音识别样本库中获取的目标语音样本的样本数量足够大,后续足以构建用于训练关键词识别模型的训练样本集合。Optionally, when a keyword pre-designated by the developer is obtained, the keyword can be used in the existing speech recognition sample library, that is, the large vocabulary continuous speech recognition system provides multiple scenarios in multiple scenarios. Select the target voice sample that includes the keyword from the large vocabulary sample set of user voice; at this time, because the voice recognition technology is studied and used by developers in multiple fields in multiple scenarios, the existing voice recognition sample library It includes a large number of multiple user voices in multiple scenarios, so that the obtained target voice samples are diverse voice samples in multiple scenarios, and can guarantee the samples of the target voice samples obtained in the existing voice recognition sample library The number is large enough to build a training sample set for training the keyword recognition model later.
可选的,如图1B所示,在本实施例中,在已有的语音识别样本库中获取包括关键词的目标语音样本,可以包括:在已有的语音识别样本库中,查找标注数据中包括关键词的语音样本,将查找到的语音样本作为目标语音样本。Optionally, as shown in FIG. 1B, in this embodiment, obtaining target voice samples including keywords in an existing voice recognition sample library may include: searching for labeled data in the existing voice recognition sample library The voice samples with keywords are included in and the found voice samples are used as the target voice samples.
在一实施例中,在已有的语音识别样本库包含的语音样本可以由相应的音 频数据和标注数据两部分组成;其中音频数据可以为表示该语音样本中用户声音频率、幅度变化以及持续时长等声音信号特征的数据,每个音频数据中可以通过记录相应用户语音中声音波形来展示;标注数据可以为记录用户语音内容的编号和文字信息等。此时在获取到指定的关键词时,可以查询已有的语音识别样本库,通过遍历已有的语音识别样本库中包含的每一个语音样本,对组成每个语音样本的标注数据进行解析,判断标注数据中是否包括指定的关键词,从而查找出由包括指定的关键词的标注数据组成的语音样本,忽略不包括指定的关键词的标注数据组成的语音样本,进而将查找出的语音样本作为目标语音样本,以便进行后续的关键词分析。In an embodiment, the voice samples contained in the existing voice recognition sample library can be composed of two parts: corresponding audio data and annotation data; wherein the audio data can represent the frequency, amplitude change and duration of the user’s voice in the voice sample For data such as sound signal characteristics, each audio data can be displayed by recording the sound waveform in the corresponding user's voice; the label data can be the number and text information that record the user's voice content. At this time, when the specified keywords are obtained, the existing speech recognition sample library can be queried. By traversing each voice sample contained in the existing speech recognition sample library, the annotation data that constitutes each speech sample can be analyzed. Determine whether the specified keywords are included in the labeled data, so as to find the voice samples composed of labeled data including the specified keywords, ignore the voice samples composed of labeled data that do not include the specified keywords, and then search for the voice samples As a target voice sample for subsequent keyword analysis.
以Kaldi语音识别框架为例对查找过程进行实例说明,Kaldi语音识别框架下提供了大量已经公开的语音识别样本库,如中文的aishell和thchs30样本库等,英文的wsj和librispeech样本库等;此时该已有的语音识别样本库中包含大量由音频数据和标注数据两部分组成的语音样本,例如标注数据如下:“BAC009S0002W0130财政金融政策紧随其后而来”;其中,“BAC009S0002W0130”表示该标注数据所组成的语音样本的编号,能够明确标注数据与语音样本之间的匹配关系;“财政金融政策紧随其后而来”表示该标注数据所组成的语音样本中包含内容的文字信息。在一实施例中,如获取的关键词为“金融”,则查询已有的语音识别样本库中,遍历其中包含的多个语音样本的标注数据,提取出标注数据中包括该关键词“金融”的语音样本,如上述示例的内容为“财政金融政策紧随气候而来”的语音样本,将查找出的该语音样本作为目标语音样本,此时可以在Kaldi语音识别框架下提供的大量已经公开的语音识别样本库中获取到大量多类场景下包含关键词“金融”的目标语音样本,后续对目标语音样本进行处理,得到多类场景下对应的关键词语音。Take the Kaldi speech recognition framework as an example to illustrate the search process. A large number of public speech recognition sample libraries are provided under the Kaldi speech recognition framework, such as aishell and thchs30 sample libraries in Chinese, and wsj and librispeech sample libraries in English; At this time, the existing speech recognition sample library contains a large number of speech samples composed of audio data and labeled data. For example, the labeled data is as follows: "BAC009S0002W0130 fiscal and financial policies follow immediately"; among them, "BAC009S0002W0130" means that The number of the voice sample composed of the annotation data can clearly indicate the matching relationship between the data and the voice sample; "Financial and financial policies follow immediately" means that the voice sample composed of the annotation data contains textual information of the content. In one embodiment, if the acquired keyword is "finance", query the existing voice recognition sample library, traverse the label data of multiple voice samples contained therein, and extract the label data including the keyword "finance" "" voice sample, such as the content of the above example is the voice sample of "Financial and financial policies follow the climate". The voice sample found is used as the target voice sample. At this time, a large number of already provided under the Kaldi voice recognition framework A large number of target voice samples containing the keyword "finance" in multiple scenarios are obtained in the public voice recognition sample library, and the target voice samples are subsequently processed to obtain corresponding keyword voices in multiple scenarios.
S130,确定目标语音样本中的关键词语音片段,得到关键词样本。S130: Determine a keyword voice segment in the target voice sample to obtain a keyword sample.
其中,关键词语音片段是指语音样本中仅携带有指定的关键词对应的语音,而不存在其他内容对应语音的语音片段。Among them, the keyword voice segment refers to that the voice sample only carries the voice corresponding to the specified keyword, and there is no voice segment corresponding to the voice of other content.
在一实施例中,本实施例在获取到目标语音样本后,通过特定的语音识别技术对目标语音样本进行识别,得到表示该目标语音样本的语音特征信息的识别结果,并根据该识别结果确定出目标语音样本中包含的关键词所处的语音范围,进而在目标语音样本中确定出对应的关键词语音片段,并在目标语音样本中的对应语音范围内截取出该关键词语音片段,此时该关键词语音片段中仅包含关键词的内容和声音特征信息,而不存在关键词以外其他内容的信息,因此将该关键词语音片段作为本实施例中的关键词样本。In one embodiment, after acquiring the target voice sample, this embodiment recognizes the target voice sample through a specific voice recognition technology to obtain a recognition result representing the voice feature information of the target voice sample, and determine according to the recognition result Find out the speech range of the keywords contained in the target speech sample, and then determine the corresponding keyword speech segment in the target speech sample, and intercept the keyword speech segment in the corresponding speech range in the target speech sample. At this time, the keyword voice segment only contains the content of the keyword and the sound feature information, and there is no information other than the keyword, so the keyword voice segment is used as the keyword sample in this embodiment.
在一实施例中,由于通过遍历已有的语音识别样本库中每一个语音样本, 可以获取到大量在多类场景下的标注数据中包括指定的关键词的目标语音样本,因此从目标语音样本中确定的关键词语音片段的数量也足够多,进而能够得到在多类场景下的关键词样本,以便后续通过多类场景下的关键词样本对相应的关键词识别模型进行训练。In one embodiment, by traversing each voice sample in the existing voice recognition sample library, a large number of target voice samples that include specified keywords in the labeled data in multiple scenarios can be obtained, so from the target voice samples The number of keyword speech fragments determined in is also large enough, so that keyword samples in multiple scenarios can be obtained, so that the corresponding keyword recognition model can be subsequently trained through the keyword samples in multiple scenarios.
本实施例提供的技术方案,通过在已有的语音识别样本库中获取包含关键词的目标语音样本,并截取出目标语音样本中的关键词语音片段,得到关键词样本,由于已有的语音识别样本库中包含大量多类用户或者多类场景下的语音样本,此时获取的包含关键词的目标语音样本也相应处于多种语音场景类型下,使得截取出的关键词语音片段也处于多种语音场景类型下,进而得到多样化的关键词样本,无需通过专门重复录制在多个场景下多用户的关键词语音来生成关键词样本,减少了关键词样本的获取成本,提高了关键词样本确定的全面性。The technical solution provided in this embodiment obtains target voice samples containing keywords from an existing voice recognition sample library, and intercepts the keyword voice fragments in the target voice samples to obtain keyword samples. The recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios. At this time, the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types. Under this type of voice scenario, diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
实施例二Example two
图2A为本申请实施例二提供的一种关键词样本确定方法的流程图,图2B为本申请实施例二提供的一种关键词样本的确定过程的原理示意图。本实施例中以上述实施例提供的技术方案为基础。本实施例中对目标语音样本中关键词语音片段的确定过程进行解释说明。FIG. 2A is a flowchart of a method for determining a keyword sample provided in the second embodiment of the application, and FIG. 2B is a schematic diagram of the principle of a method for determining a keyword sample provided in the second embodiment of the application. This embodiment is based on the technical solution provided in the foregoing embodiment. In this embodiment, the process of determining the keyword speech segment in the target speech sample is explained.
可选的,如图2A所示,本实施例中可以包括如下步骤:Optionally, as shown in FIG. 2A, this embodiment may include the following steps:
S210,获取关键词。S210: Acquire keywords.
S220,在已有的语音识别样本库中获取包括关键词的目标语音样本。S220: Acquire target voice samples including keywords from an existing voice recognition sample library.
S230,确定关键词的音素在目标语音样本的音频数据音素中的起始时间点和终止时间点。S230: Determine the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target voice sample.
其中,音素是根据语音属性划分的最小语音单位,可以依据用户语音的发音动作来分析;本实施例中的音素可以为语音构成中的多个声母和韵母。本实施例中预先为存在的每一个音素设定相应的编号,并存储于对应的音素表中,以便后续根据每个音素的编号对目标语音样本进行识别。同时,由于目标语音样本的音频数据是表示用户声音频率、幅度变化以及持续时长等声音信号特征的数据,也就是持续一段时间的语音数据,因此该音频数据中包含的用户发出的每一词语均匹配有相应的起止时间范围,此时起始时间点是指在目标语音样本的音频数据中用户开始发出该关键词时所处的时间点,终止时间点是指在目标语音样本的音频数据中用户结束发出该关键词时所处的时间点。Among them, the phoneme is the smallest phonetic unit divided according to the speech attributes, which can be analyzed according to the pronunciation action of the user's voice; the phoneme in this embodiment may be multiple initials and finals in the speech composition. In this embodiment, a corresponding number is set for each existing phoneme in advance, and is stored in the corresponding phoneme table, so that the target speech sample can be subsequently identified according to the number of each phoneme. At the same time, since the audio data of the target voice sample is the data representing the characteristics of the sound signal such as the frequency, amplitude change, and duration of the user’s voice, that is, the voice data that lasts for a period of time, every word uttered by the user contained in the audio data is Matching has a corresponding start and end time range. At this time, the start time point refers to the time point at which the user starts to pronounce the keyword in the audio data of the target voice sample, and the end time point refers to the audio data of the target voice sample The point in time when the user ended sending the keyword.
在一实施例中,本实施例在获取到标注数据中包括关键词的目标语音样本时,对组成该目标语音样本的音频数据进行语音识别,而由于音频数据是持续 一段时间的声音特征数据,且属于准稳态的语音信号,此时在对音频数据进行语音识别时,会确定该音频数据的分帧情况,一般设定语音帧长度为20ms-30ms,本实施例中的语音帧长度为20ms,进而对每一语音帧内的音频数据包含的音素进行识别,此时根据预先设定的音素编号以及语音帧长度对目标语音样本中的音频数据进行识别,得到对应音素识别结果,并确定该关键词的音素在该音素识别结果中存在的范围,也就是关键词的音素在音素识别结果中起始点和终止点,进而根据设定的语音帧长度以及音素识别结果中起始点和终止点对应的音素编号数量,确定关键词的音素在目标语音样本的音频数据音素中的起始时间点和终止时间点。In one embodiment, when a target voice sample that includes keywords in the tagging data is obtained in this embodiment, voice recognition is performed on the audio data that constitutes the target voice sample, and since the audio data is voice feature data that lasts for a period of time, And it belongs to a quasi-steady-state voice signal. At this time, when performing voice recognition on audio data, the framing situation of the audio data will be determined. Generally, the length of the voice frame is set to 20ms-30ms. The length of the voice frame in this embodiment is 20ms, and then recognize the phonemes contained in the audio data in each speech frame. At this time, the audio data in the target speech sample is recognized according to the preset phoneme number and the length of the speech frame, and the corresponding phoneme recognition result is obtained and determined The range in which the phoneme of the keyword exists in the phoneme recognition result, that is, the starting point and ending point of the phoneme of the keyword in the phoneme recognition result, and then according to the set speech frame length and the starting point and ending point in the phoneme recognition result The number of corresponding phonemes determines the start time point and end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample.
示例性的,对于“财政金融政策紧随其后而来”的目标语音样本,关键词为“金融”,音频数据对应的波形如图2C所示,关键词“金融”对应的音素为j、in、r和ong,其中由于用户在发音时,两个字之间可能存在一定时长的静音,因此音频数据中包含的关键词中的“金”和“融”之间会有一定的静音,预先设定静音的编号为“1”,j的编号为“17”,in的编号为“23”,r的编号为“18”以及ong的编号为“27”,语音帧长度为20ms,此时根据音素编号以及语音帧长度对该音频数据进行识别,得到对应的音素识别结果为“1 1 1 1 1…17 17 17 17 23 23 23 23 23 23 23 1 18 18 18 27 27 27 27 27 27…”,每一个编号对应一个语音帧长度,此时可以观察到关键词中的“金”对应的音素“j”的编号17共有4帧,“in”的编号23共有7帧,“融”对应的音素“r”的编号18共有3帧,“ong”的编号27共有6帧,此时“金”对应的音素“j”的第一帧在整个音素识别结果中为第63帧,因此该关键词“金融”中的“金”在音频数据中的起始时间点为62*20ms=1.24s,“金”在音素识别结果中共持续了11帧,因此“金”在音频数据中的持续时长为11*20ms=0.22s;同样,可以得到“融”在音频数据中的起始时间点为1.24s+0.22s+20ms=1.48s,“融”在音素识别结果中共持续了9帧,对应的持续时长为9*20ms=0.18s,因此“金融”在目标语音样本的音频数据中的总共持续时长为0.22s+20ms+0.18s=0.42s;从而确定关键词“金融”在目标语音样本的音频数据中的起始时间点为1.24s,终止时间点为1.66s。Exemplarily, for the target speech sample of "Financial and Financial Policy", the keyword is "Finance", the waveform corresponding to the audio data is shown in Figure 2C, and the phoneme corresponding to the keyword "Finance" is j, In, r and ong, there may be a certain period of silence between the two characters when the user is speaking, so there will be a certain amount of silence between "金" and "融" in the keywords contained in the audio data. The preset number of mute is "1", the number of j is "17", the number of in is "23", the number of r is "18" and the number of ong is "27", and the voice frame length is 20ms. According to the phoneme number and the length of the voice frame, the audio data is recognized, and the corresponding phoneme recognition result is “1 1 1 1 1…17 17 17 17 23 23 23 23 23 23 23 1 18 18 18 27 27 27 27 27 27 …", each number corresponds to the length of a voice frame. At this time, it can be observed that the number 17 of the phoneme "j" corresponding to the keyword "gold" has 4 frames, and the number 23 of "in" has 7 frames, "Rong" The number 18 of the corresponding phoneme "r" has 3 frames, and the number 27 of "ong" has 6 frames. At this time, the first frame of the phoneme "j" corresponding to "Gold" is the 63rd frame in the entire phoneme recognition result, so The start time point of "gold" in the keyword "finance" in the audio data is 62*20ms=1.24s, and "gold" lasts for 11 frames in the phoneme recognition result, so "gold" in the audio data The duration is 11*20ms=0.22s; in the same way, the starting time point of "fusion" in the audio data is 1.24s+0.22s+20ms=1.48s, and the "fusion" lasts for 9 frames in the phoneme recognition result. , The corresponding duration is 9*20ms=0.18s, so the total duration of "finance" in the audio data of the target voice sample is 0.22s+20ms+0.18s=0.42s; thus it is determined that the keyword "finance" is in the target The start time point in the audio data of the voice sample is 1.24s, and the end time point is 1.66s.
S240,根据起始时间点和终止时间点截取所述起始时间点和所述终止时间点之间的音频数据,得到关键词语音片段。S240: Intercept the audio data between the start time point and the end time point according to the start time point and the end time point to obtain a keyword voice segment.
可选的,在确定关键词的音素在目标语音样本的音频数据音素中的起始时间点和终止时间点时,可以在该音频数据中截取出位于起始时间点和终止时间点之间的音频数据片段,也就是在上述“财政金融政策紧随其后而来”的目标语音样本对应的音频数据中,截取出位于1.24s到1.66s之间的音频数据片段,或者在该音频数据中从1.24s开始,截取出持续时长为0.42s的音频数据片段,作为本实施例中的关键词语音片段,此时该关键词语音片段中仅包含有关键词 “金融”的语音信息。Optionally, when determining the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, the audio data that is located between the start time point and the end time point can be truncated Audio data segment, that is, in the audio data corresponding to the target voice sample of the above-mentioned "Financial and Financial Policy", intercept the audio data segment between 1.24s and 1.66s, or in the audio data Starting from 1.24s, an audio data segment with a duration of 0.42s is intercepted and used as the keyword voice segment in this embodiment. At this time, the keyword voice segment only contains the voice information of the keyword "finance".
S250,在关键词语音片段的起始时间点之前和终止时间点之后填充预设长度的静音数据,得到关键词样本。S250: Fill the silence data of a preset length before the start time point of the keyword speech segment and after the end time point to obtain a keyword sample.
可选的,在得到对应的关键词语音片段时,为了保证关键词样本的独立性,本实施例中可以在得到的关键词语音片段的前和后的位置填充预设长度的静音数据,本实施例中的静音数据可以为预设语音帧长度的数据“0”,从而得到一个独立的关键词样本,便于后续与其他语音样本进行区分。Optionally, in order to ensure the independence of the keyword samples when the corresponding keyword speech fragment is obtained, in this embodiment, mute data of a preset length can be filled in the positions before and after the obtained keyword speech fragment. The mute data in the embodiment may be data "0" of the preset voice frame length, so as to obtain an independent keyword sample, which is convenient for subsequent differentiation from other voice samples.
以aishell语音识别样本库为例,其中包含了178小时以及400人在多个领域下的语音样本,此时可以查找出包含有关键词“金融”的目标语音样本一共有610条,通过本实施例中的关键词样本确定方法分别对查找出的610条目标语音样本进行关键词截取,可以得到关键词为“金融”的610条关键词样本,进而得到多样化的关键词样本集合,为后续的关键词识别模型的训练创造了一定的条件。Take the aishell speech recognition sample library as an example, which contains 178 hours and 400 people's speech samples in multiple fields. At this time, a total of 610 target speech samples containing the keyword "finance" can be found. Through this implementation The keyword sample determination method in the example performs keyword interception on the 610 target voice samples found, and 610 keyword samples with the keyword "financial" can be obtained, and then a diversified keyword sample set is obtained, which is the follow-up The training of the keyword recognition model created certain conditions.
本实施例提供的技术方案,通过确定关键词的音素在目标语音样本的音频数据音素中的起始时间点和终止时间点,并截取出目标语音样本的音频数据中位于起始时间点和终止时间点之间的关键词语音片段,得到关键词样本,保证关键词样本确定的多样化,无需通过专门重复录制在多个场景下多用户的关键词语音来生成关键词样本,减少了关键词样本的获取成本,提高了关键词样本确定的全面性和准确性。The technical solution provided in this embodiment determines the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample, and intercepts the audio data of the target speech sample at the start time point and the end time point. Keyword speech fragments between time points to obtain keyword samples to ensure the diversification of keyword samples. There is no need to generate keyword samples by repeatedly recording the keyword voices of multiple users in multiple scenarios, reducing keywords The acquisition cost of the sample improves the comprehensiveness and accuracy of the keyword sample determination.
实施例三Example three
图3A为本申请实施例三提供的一种语音识别方法的流程图,本实施例可应用于任一种对用户的语音指令中包含的关键词进行识别的情况中。本申请实施例的方案可以适用于如何解决关键词识别模型训练过程繁琐的问题。本实施例提供的一种语音识别方法可以由本申请实施例提供的语音识别装置来执行,该装置可以通过软件和/或硬件的方式来实现,并集成在执行本方法的设备中,该设备可以是任一种智能终端设备,如笔记本电脑、平板或者台式机等。FIG. 3A is a flowchart of a voice recognition method provided in Embodiment 3 of this application. This embodiment can be applied to any situation of recognizing keywords included in a user's voice instruction. The solution of the embodiment of the present application may be applicable to how to solve the problem of cumbersome training process of the keyword recognition model. The voice recognition method provided in this embodiment can be executed by the voice recognition device provided in the embodiment of this application. The device can be implemented by software and/or hardware, and is integrated in the device that executes the method. The device can It is any kind of smart terminal equipment, such as laptop, tablet or desktop.
参考图3A,本实施例可以包括如下步骤:Referring to FIG. 3A, this embodiment may include the following steps:
S310,获取用户的语音指令。S310: Acquire a user's voice instruction.
在一实施例中,用户在需要执行一项操作时,会发出携带有与该操作对应的关键词的语音,设备在接收到用户发出的语音时生成对应的语音指令,该语音指令中携带有相应的关键词;本实施例中会根据应用场景不同预先设定多个关键词与不同操作之间的匹配关系,如在短视频应用中可以设置预定义的不同 关键词与不同视频特效之间的匹配关系,而在直播应用中可以设置预定义的关键词来在直播间中赠送相应的礼物等。In one embodiment, when the user needs to perform an operation, a voice carrying a keyword corresponding to the operation will be emitted, and the device will generate a corresponding voice command when receiving the voice uttered by the user, and the voice command carries Corresponding keywords; in this embodiment, the matching relationship between multiple keywords and different operations is preset according to different application scenarios. For example, in a short video application, you can set different predefined keywords and different video effects In the live broadcast application, you can set predefined keywords to present corresponding gifts in the live broadcast room.
S320,通过关键词识别模型识别语音指令中的关键词。S320: Recognize the keywords in the voice instruction through the keyword recognition model.
其中,关键词识别模型预先通过本申请实施例提供的关键词样本确定方法确定的关键词样本进行训练。示例性的,本实施例获取用户预先指定的关键词,并查询已有的语音识别样本库中包含的每一语音样本,判断组成该语音样本的标注数据中是否包括指定的关键词,进而将标注数据中包括指定的关键词的语音样本作为目标语音样本,并根据词语音素确定关键词音素在目标语音样本的音频数据音素中的起始时间点和终止时间点,截取出位于起始时间点和终止时间点之间的音频数据片段,作为关键词语音片段,进而得到大量关键词样本。本实施例中在得到多类关键词的关键词样本后,会生成相应的关键词样本库,该关键词样本库中包含有用户指定的多个关键词下的不同场景以及不同用户发出的仅包含关键词语音的关键词样本。Among them, the keyword recognition model is trained in advance by keyword samples determined by the keyword sample determination method provided in the embodiments of the present application. Exemplarily, this embodiment acquires keywords pre-specified by the user, queries each voice sample included in the existing voice recognition sample library, determines whether the specified keywords are included in the annotation data that constitutes the voice sample, and then The marked data includes the speech sample of the specified keyword as the target speech sample, and the start time point and the end time point of the keyword phoneme in the audio data phoneme of the target speech sample are determined according to the word phoneme, and cut out at the start time point The audio data segment between and the end time point is used as the keyword speech segment, and then a large number of keyword samples are obtained. In this embodiment, after the keyword samples of multiple types of keywords are obtained, a corresponding keyword sample library will be generated. The keyword sample library contains different scenarios under multiple keywords specified by the user and only messages sent by different users. Keyword samples containing the key word voice.
在一实施例中,如图3B所示,在得到包含多个关键词对应的不同场景下的关键词样本的关键词样本库后,可以通过该关键词样本库中包含的大量关键词样本对预先设定的关键词识别模型进行训练,此时通过将多个关键词对应的关键词样本输入预先设定的关键词识别模型中,得到该关键词样本对应的关键词识别结果,并判断本次识别存在的分类损失,在该分类损失超出预设损失阈值时,根据该分类损失对关键词识别模型进行修复,并继续获取该关键词下对应的关键词样本,再次输入到修复后的关键词识别模型中进行关键词识别,直至得到的分类损失未超出预设损失阈值,此时获取关键词样本库中的下一关键词对应的关键词样本在此进行训练,直至对关键词样本库中包含的每一关键词下的关键词样本均进行训练,进而得到最终的关键词识别模型,此时该关键词识别模型能够准确识别出任意语音中的关键词。In one embodiment, as shown in FIG. 3B, after obtaining a keyword sample library containing keyword samples in different scenarios corresponding to multiple keywords, a large number of keyword sample pairs contained in the keyword sample library can be used The pre-set keyword recognition model is trained. At this time, by inputting the keyword samples corresponding to multiple keywords into the preset keyword recognition model, the keyword recognition results corresponding to the keyword samples are obtained, and the original Recognize the existing classification loss. When the classification loss exceeds the preset loss threshold, repair the keyword recognition model according to the classification loss, and continue to obtain the corresponding keyword samples under the keyword, and enter the key after repair again Keyword recognition is performed in the word recognition model until the classification loss obtained does not exceed the preset loss threshold. At this time, the keyword sample corresponding to the next keyword in the keyword sample library is obtained and training is carried out here until the keyword sample library The keyword samples under each keyword contained in are trained to obtain the final keyword recognition model. At this time, the keyword recognition model can accurately recognize the keywords in any speech.
可选的,本实施例在获取到用户的语音指令时,可以将该语音指令输入到预先训练好的关键词识别模型中,由关键词识别模型对该语音指令进行解析,进而准确识别出该语音指令中携带的关键词,以便后续根据该关键词执行相应的操作。Optionally, when the user’s voice command is acquired in this embodiment, the voice command can be input into a pre-trained keyword recognition model, and the keyword recognition model parses the voice command to accurately recognize the voice command. The keyword carried in the voice instruction, so that the corresponding operation can be performed according to the keyword.
S330,根据关键词触发与所述关键词对应的操作。S330: Trigger an operation corresponding to the keyword according to the keyword.
在一实施例中,通过关键词识别模型识别出用户的语音指令中携带的关键词后,通过对该携带的关键词进行分析,确定与该关键词匹配的操作,进而触发执行该操作,实现相应的语音交互控制。In one embodiment, after the keyword carried in the user's voice instruction is recognized by the keyword recognition model, the carried keyword is analyzed to determine the operation matching the keyword, and then the execution of the operation is triggered to achieve The corresponding voice interactive control.
本实施例提供的技术方案,通过上述关键词样本的确定方法确定的关键词 样本对预先设定的关键词识别模型进行训练,使得该关键词识别模型能够准确识别出语音指令中携带的关键词,进而根据识别出的关键词触发执行相应的操作,简化了模型训练时采集关键词样本的操作繁琐度,减少了关键词样本的获取成本,通过该关键词样本训练得到的关键词识别模型来识别相应用户语音中携带的关键词,提高了语音识别的准确性。The technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the keyword sample determination method described above, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples. The keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
实施例四Example four
图4为本申请实施例四提供的一种关键词样本确定装置的结构示意图,如图4所示,该装置可以包括:FIG. 4 is a schematic structural diagram of a keyword sample determining device provided in Embodiment 4 of this application. As shown in FIG. 4, the device may include:
关键词获取模块410,设置为获取关键词;The keyword acquisition module 410 is set to acquire keywords;
目标语音获取模块420,设置为在已有的语音识别样本库中获取包括关键词的目标语音样本;The target voice acquisition module 420 is configured to acquire target voice samples including keywords from an existing voice recognition sample library;
关键词样本确定模块430,设置为确定目标语音样本中的关键词语音片段,得到关键词样本。The keyword sample determining module 430 is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
本实施例提供的技术方案,通过在已有的语音识别样本库中获取包含关键词的目标语音样本,并截取出目标语音样本中的关键词语音片段,得到关键词样本,由于已有的语音识别样本库中包含大量多类用户或者多类场景下的语音样本,此时获取的包含关键词的目标语音样本也相应处于多种语音场景类型下,使得截取出的关键词语音片段也处于多种语音场景类型下,进而得到多样化的关键词样本,无需通过专门重复录制在多个场景下多用户的关键词语音来生成关键词样本,减少了关键词样本的获取成本,提高了关键词样本确定的全面性。The technical solution provided in this embodiment obtains target speech samples containing keywords in an existing speech recognition sample library, and intercepts the keyword speech fragments in the target speech samples to obtain keyword samples. The recognition sample library contains a large number of voice samples in multiple types of users or in multiple scenarios. At this time, the target voice samples that contain keywords are also in multiple voice scenarios, so that the extracted keyword voice fragments are also in multiple types. Under this type of voice scenario, diversified keyword samples can be obtained. There is no need to generate keyword samples by repeatedly recording keyword voices of multiple users in multiple scenarios. This reduces the acquisition cost of keyword samples and increases keywords The comprehensiveness of the sample determination.
本实施例提供的关键词样本确定装置可适用于上述本申请任意实施例提供的关键词样本确定方法,具备相应的功能和效果。The keyword sample determining device provided in this embodiment is applicable to the keyword sample determining method provided in any embodiment of the present application, and has corresponding functions and effects.
实施例五Example five
图5为本申请实施例五提供的一种语音识别装置的结构示意图,如图5所示,该装置可以包括:FIG. 5 is a schematic structural diagram of a speech recognition device provided in Embodiment 5 of this application. As shown in FIG. 5, the device may include:
语音指令获取模块510,设置为获取用户的语音指令;The voice instruction acquiring module 510 is configured to acquire the user's voice instruction;
关键词识别模块520,设置为通过关键词识别模型识别语音指令中的关键词,该关键词识别模型预先通过上述实施例提供的关键词样本确定装置确定的关键词样本训练;The keyword recognition module 520 is configured to recognize keywords in voice instructions through a keyword recognition model, which is trained in advance by keyword samples determined by the keyword sample determining device provided in the above-mentioned embodiment;
操作触发模块530,设置为根据关键词触发与关键词对应的操作。The operation trigger module 530 is configured to trigger the operation corresponding to the keyword according to the keyword.
本实施例提供的技术方案,通过上述关键词样本的确定方确定的关键词样本对预先设定的关键词识别模型进行训练,使得该关键词识别模型能够准确识别出语音指令中携带的关键词,进而根据识别出的关键词触发执行相应的操作,简化了模型训练时采集关键词样本的操作繁琐度,减少了关键词样本的获取成本,通过该关键词样本训练得到的关键词识别模型来识别相应用户语音中携带的关键词,提高了语音识别的准确性。The technical solution provided in this embodiment trains a preset keyword recognition model through the keyword samples determined by the above keyword sample determiner, so that the keyword recognition model can accurately recognize the keywords carried in the voice command , And then trigger the execution of corresponding operations based on the identified keywords, simplifying the cumbersome operation of collecting keyword samples during model training, and reducing the cost of acquiring keyword samples. The keyword recognition model obtained through the keyword sample training Recognizing the keywords carried in the corresponding user's voice improves the accuracy of voice recognition.
本实施例提供的语音识别装置可适用于上述申请任意实施例提供的语音识别方法,具备相应的功能和效果。The voice recognition device provided in this embodiment is applicable to the voice recognition method provided in any embodiment of the above application, and has corresponding functions and effects.
实施例六Example Six
图6为本申请实施例六提供的一种设备的结构示意图,如图6所示,该设备包括处理器60、存储装置61和通信装置62;设备中处理器60的数量可以是一个或多个,图6中以一个处理器60为例;设备中的处理器60、存储装置61和通信装置62可以通过总线或其他方式连接,图6中以通过总线连接为例。FIG. 6 is a schematic structural diagram of a device provided by Embodiment 6 of this application. As shown in FIG. 6, the device includes a processor 60, a storage device 61, and a communication device 62; the number of processors 60 in the device may be one or more. One, a processor 60 is taken as an example in FIG. 6; the processor 60, the storage device 61, and the communication device 62 in the device may be connected by a bus or other means. In FIG. 6, the connection by a bus is taken as an example.
本实施例提供的一种设备可用于执行上述任意实施例提供的关键词样本确定方法或者语音识别方法,具备相应的功能和效果。The device provided in this embodiment can be used to execute the keyword sample determination method or the voice recognition method provided in any of the foregoing embodiments, and has corresponding functions and effects.
实施例七Example Seven
本申请实施例七还提供了一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该程序被处理器执行时可实现上述任意实施例中的关键词样本确定方法。该方法可以包括:The seventh embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method for determining a keyword sample in any of the foregoing embodiments can be implemented. The method can include:
获取关键词;Get keywords;
在已有的语音识别样本库中获取包括关键词的目标语音样本;Obtain target speech samples including keywords from the existing speech recognition sample library;
确定目标语音样本中的关键词语音片段,得到关键词样本。Determine the keyword voice segment in the target voice sample to obtain the keyword sample.
或者,实现上述任意实施例中的语音识别方法,该方法可以包括:Or, to implement the voice recognition method in any of the foregoing embodiments, the method may include:
获取用户的语音指令;Obtain the user's voice instructions;
通过关键词识别模型识别语音指令中的关键词,该关键词识别模型预先通过如上述任意实施例提供的关键词样本确定方法确定的关键词样本训练;Recognizing the keywords in the voice instructions through a keyword recognition model, which is trained in advance through keyword samples determined by the keyword sample determination method provided in any of the above embodiments;
根据关键词触发相应的操作。Trigger corresponding actions based on keywords.
本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供 的关键词样本确定方法或者语音识别方法中的相关操作。An embodiment of the application provides a storage medium containing computer-executable instructions. The computer-executable instructions are not limited to the method operations described above, and can also execute the keyword sample determination method or voice provided by any embodiment of the application. Relevant operations in the identification method.
本申请可借助软件及必需的通用硬件来实现,也可以通过硬件实现。本申请可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括至少一个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请多个实施例所述的方法。This application can be implemented with the help of software and necessary general-purpose hardware, or can be implemented with hardware. This application can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-Only Memory (ROM), and Random Access Memory (Random Access Memory). , RAM), flash memory (FLASH), hard disk or optical disk, etc., including at least one instruction to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in multiple embodiments of the present application.
上述关键词样本确定装置或者语音识别装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。In the above embodiments of the keyword sample determination device or the speech recognition device, the multiple units and modules included are only divided according to the functional logic, but are not limited to the above division, as long as the corresponding function can be realized; The names of multiple functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of this application.

Claims (10)

  1. 一种关键词样本确定方法,包括:A method for determining keyword samples, including:
    获取关键词;Get keywords;
    在已有的语音识别样本库中获取包括所述关键词的目标语音样本;Acquire target speech samples including the keywords from an existing speech recognition sample library;
    确定所述目标语音样本中的关键词语音片段,得到关键词样本。The keyword voice segment in the target voice sample is determined to obtain the keyword sample.
  2. 根据权利要求1所述的方法,其中,所述在已有的语音识别样本库中获取包括所述关键词的目标语音样本,包括:The method according to claim 1, wherein said obtaining target speech samples including said keywords from an existing speech recognition sample library comprises:
    在所述已有的语音识别样本库中,查找标注数据中包括所述关键词的语音样本,将查找到的语音样本作为所述目标语音样本。In the existing speech recognition sample library, search for speech samples that include the keyword in the labeled data, and use the searched speech samples as the target speech samples.
  3. 根据权利要求1所述的方法,其中,所述确定所述目标语音样本中的关键词语音片段,包括:The method according to claim 1, wherein said determining the keyword speech segment in the target speech sample comprises:
    确定所述关键词的音素在所述目标语音样本的音频数据音素中的起始时间点和终止时间点;Determining the start time point and the end time point of the phoneme of the keyword in the audio data phoneme of the target speech sample;
    根据所述起始时间点和所述终止时间点截取所述起始时间点和所述终止时间点之间的音频数据,得到所述关键词语音片段。The audio data between the start time point and the end time point is intercepted according to the start time point and the end time point to obtain the keyword speech segment.
  4. 根据权利要求1至3中任一项所述的方法,其中,所述得到关键词样本,包括:The method according to any one of claims 1 to 3, wherein said obtaining a keyword sample comprises:
    在所述关键词语音片段的起始时间点之前和终止时间点之后填充预设长度的静音数据,得到所述关键词样本。Fill a preset length of mute data before the start time point and after the end time point of the keyword speech segment to obtain the keyword sample.
  5. 一种语音识别方法,包括:A voice recognition method, including:
    获取用户的语音指令;Obtain the user's voice instructions;
    通过关键词识别模型识别所述语音指令中的关键词,所述关键词识别模型预先通过如权利要求1至4中任一项所述的关键词样本确定方法确定的关键词样本训练;Recognizing the keywords in the voice instructions through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining method according to any one of claims 1 to 4;
    根据所述关键词触发与所述关键词对应的操作。The operation corresponding to the keyword is triggered according to the keyword.
  6. 一种关键词样本确定装置,包括:A keyword sample determining device, including:
    关键词获取模块,设置为获取关键词;Keyword acquisition module, set to acquire keywords;
    目标语音获取模块,设置为在已有的语音识别样本库中获取包括所述关键词的目标语音样本;The target voice acquisition module is configured to acquire target voice samples including the keywords in an existing voice recognition sample library;
    关键词样本确定模块,设置为确定所述目标语音样本中的关键词语音片段,得到关键词样本。The keyword sample determining module is configured to determine the keyword voice segment in the target voice sample to obtain the keyword sample.
  7. 根据权利要求6所述的装置,其中,所述目标语音获取模块,是设置为:The device according to claim 6, wherein the target voice acquisition module is set to:
    在所述已有的语音识别样本库中,查找标注数据中包括所述关键词的语音样本,将查找到的语音样本作为所述目标语音样本。In the existing speech recognition sample library, search for speech samples that include the keyword in the labeled data, and use the searched speech samples as the target speech samples.
  8. 一种语音识别装置,包括:A speech recognition device includes:
    语音指令获取模块,设置为获取用户的语音指令;The voice command acquisition module is set to acquire the user's voice command;
    关键词识别模块,设置为通过关键词识别模型识别所述语音指令中的关键词,所述关键词识别模型预先通过如权利要求6或7所述的关键词样本确定装置确定的关键词样本训练;The keyword recognition module is configured to recognize keywords in the voice instruction through a keyword recognition model, the keyword recognition model being trained in advance through keyword samples determined by the keyword sample determining device according to claim 6 or 7. ;
    操作触发模块,设置为根据所述关键词触发与所述关键词对应的操作。The operation trigger module is configured to trigger an operation corresponding to the keyword according to the keyword.
  9. 一种设备,包括:A device that includes:
    一个或多个处理器;One or more processors;
    存储装置,设置为存储一个或多个程序;Storage device, set to store one or more programs;
    所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-4中任一项所述的关键词样本确定方法,或者实现如权利要求5中所述的语音识别方法。The one or more programs are executed by the one or more processors, so that the one or more processors implement the keyword sample determination method according to any one of claims 1 to 4, or implement The voice recognition method described in claim 5.
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-4中任一项所述的关键词样本确定方法,或者实现如权利要求5中所述的语音识别方法。A computer-readable storage medium that stores a computer program that, when executed by a processor, implements the keyword sample determination method according to any one of claims 1-4, or implements the keyword sample determination method according to claim 5 The voice recognition method described.
PCT/CN2020/077912 2019-03-13 2020-03-05 Keyword sample determining method, voice recognition method and apparatus, device, and medium WO2020182042A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910189413.1A CN109979440B (en) 2019-03-13 2019-03-13 Keyword sample determination method, voice recognition method, device, equipment and medium
CN201910189413.1 2019-03-13

Publications (1)

Publication Number Publication Date
WO2020182042A1 true WO2020182042A1 (en) 2020-09-17

Family

ID=67078805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077912 WO2020182042A1 (en) 2019-03-13 2020-03-05 Keyword sample determining method, voice recognition method and apparatus, device, and medium

Country Status (2)

Country Link
CN (1) CN109979440B (en)
WO (1) WO2020182042A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109979440B (en) * 2019-03-13 2021-05-11 广州市网星信息技术有限公司 Keyword sample determination method, voice recognition method, device, equipment and medium
CN110689895B (en) * 2019-09-06 2021-04-02 北京捷通华声科技股份有限公司 Voice verification method and device, electronic equipment and readable storage medium
CN110675896B (en) * 2019-09-30 2021-10-22 北京字节跳动网络技术有限公司 Character time alignment method, device and medium for audio and electronic equipment
CN111833856B (en) * 2020-07-15 2023-10-24 厦门熙重电子科技有限公司 Voice key information calibration method based on deep learning
CN113515454A (en) * 2021-07-01 2021-10-19 深圳创维-Rgb电子有限公司 Test case generation method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002868A1 (en) * 2002-05-08 2004-01-01 Geppert Nicolas Andre Method and system for the processing of voice data and the classification of calls
US20040006464A1 (en) * 2002-05-08 2004-01-08 Geppert Nicolas Andre Method and system for the processing of voice data by means of voice recognition and frequency analysis
CN1889170A (en) * 2005-06-28 2007-01-03 国际商业机器公司 Method and system for generating synthesized speech base on recorded speech template
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
CN108009303A (en) * 2017-12-30 2018-05-08 北京百度网讯科技有限公司 Searching method, device, electronic equipment and storage medium based on speech recognition
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN109979440A (en) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 Keyword sample determines method, audio recognition method, device, equipment and medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1113330C (en) * 1997-08-15 2003-07-02 英业达股份有限公司 Phoneme regulating method for phoneme synthesis
CN104700832B (en) * 2013-12-09 2018-05-25 联发科技股份有限公司 Voiced keyword detecting system and method
US9953632B2 (en) * 2014-04-17 2018-04-24 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
KR101703214B1 (en) * 2014-08-06 2017-02-06 주식회사 엘지화학 Method for changing contents of character data into transmitter's voice and outputting the transmiter's voice
US9959863B2 (en) * 2014-09-08 2018-05-01 Qualcomm Incorporated Keyword detection using speaker-independent keyword models for user-designated keywords
CN104517605B (en) * 2014-12-04 2017-11-28 北京云知声信息技术有限公司 A kind of sound bite splicing system and method for phonetic synthesis
CN105100460A (en) * 2015-07-09 2015-11-25 上海斐讯数据通信技术有限公司 Method and system for controlling intelligent terminal by use of sound
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN105117384A (en) * 2015-08-19 2015-12-02 小米科技有限责任公司 Classifier training method, and type identification method and apparatus
CN107451131A (en) * 2016-05-30 2017-12-08 贵阳朗玛信息技术股份有限公司 A kind of audio recognition method and device
CN107040452B (en) * 2017-02-08 2020-08-04 浙江翼信科技有限公司 Information processing method and device and computer readable storage medium
US10540961B2 (en) * 2017-03-13 2020-01-21 Baidu Usa Llc Convolutional recurrent neural networks for small-footprint keyword spotting
CN109065046A (en) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 Method, apparatus, electronic equipment and the computer readable storage medium that voice wakes up

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002868A1 (en) * 2002-05-08 2004-01-01 Geppert Nicolas Andre Method and system for the processing of voice data and the classification of calls
US20040006464A1 (en) * 2002-05-08 2004-01-08 Geppert Nicolas Andre Method and system for the processing of voice data by means of voice recognition and frequency analysis
CN1889170A (en) * 2005-06-28 2007-01-03 国际商业机器公司 Method and system for generating synthesized speech base on recorded speech template
CN105654943A (en) * 2015-10-26 2016-06-08 乐视致新电子科技(天津)有限公司 Voice wakeup method, apparatus and system thereof
CN108009303A (en) * 2017-12-30 2018-05-08 北京百度网讯科技有限公司 Searching method, device, electronic equipment and storage medium based on speech recognition
CN108182937A (en) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 Keyword recognition method, device, equipment and storage medium
CN109979440A (en) * 2019-03-13 2019-07-05 广州市网星信息技术有限公司 Keyword sample determines method, audio recognition method, device, equipment and medium

Also Published As

Publication number Publication date
CN109979440A (en) 2019-07-05
CN109979440B (en) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2020182042A1 (en) Keyword sample determining method, voice recognition method and apparatus, device, and medium
CN110322869B (en) Conference character-division speech synthesis method, device, computer equipment and storage medium
CN110517689B (en) Voice data processing method, device and storage medium
CN109616096B (en) Construction method, device, server and medium of multilingual speech decoding graph
US7412383B1 (en) Reducing time for annotating speech data to develop a dialog application
CN105931644A (en) Voice recognition method and mobile terminal
TW201203222A (en) Voice stream augmented note taking
KR20030078388A (en) Apparatus for providing information using voice dialogue interface and method thereof
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN110047469B (en) Voice data emotion marking method and device, computer equipment and storage medium
CN111798833A (en) Voice test method, device, equipment and storage medium
CN112435653A (en) Voice recognition method and device and electronic equipment
CN111897511A (en) Voice drawing method, device, equipment and storage medium
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN108877779B (en) Method and device for detecting voice tail point
WO2020233381A1 (en) Speech recognition-based service request method and apparatus, and computer device
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN113343108A (en) Recommendation information processing method, device, equipment and storage medium
Lopez-Otero et al. Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping
CN115174285A (en) Conference record generation method and device and electronic equipment
Le et al. Automatic quality estimation for speech translation using joint ASR and MT features
CN110264994B (en) Voice synthesis method, electronic equipment and intelligent home system
CN113066473A (en) Voice synthesis method and device, storage medium and electronic equipment
US20230106550A1 (en) Method of processing speech, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20770810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20770810

Country of ref document: EP

Kind code of ref document: A1