CN114863915A - Voice awakening method and system based on semantic preservation - Google Patents

Voice awakening method and system based on semantic preservation Download PDF

Info

Publication number
CN114863915A
CN114863915A CN202210780418.3A CN202210780418A CN114863915A CN 114863915 A CN114863915 A CN 114863915A CN 202210780418 A CN202210780418 A CN 202210780418A CN 114863915 A CN114863915 A CN 114863915A
Authority
CN
China
Prior art keywords
frame
voice
streaming
neural network
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210780418.3A
Other languages
Chinese (zh)
Inventor
李郡
付冠宇
王啸
尚德龙
周玉梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Nanjing Intelligent Technology Research Institute
Original Assignee
Zhongke Nanjing Intelligent Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Nanjing Intelligent Technology Research Institute filed Critical Zhongke Nanjing Intelligent Technology Research Institute
Priority to CN202210780418.3A priority Critical patent/CN114863915A/en
Publication of CN114863915A publication Critical patent/CN114863915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a voice awakening method and system based on semantic preservation. The method comprises the following steps: acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length; marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags; training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network; and utilizing the neural network of the streaming voice awakening system to identify voice data, and carrying out voice awakening correspondingly according to an identification result. The invention can improve the accuracy and stability of voice awakening.

Description

Voice awakening method and system based on semantic preservation
Technical Field
The invention relates to the field of voice awakening, in particular to a voice awakening method and system based on semantic preservation.
Background
With the development of intelligent devices, voice interaction is widely applied, and a voice wake-up system is a key for enabling voice interaction. The goal of the voice wake-up system is to find the set keywords in the continuous voice input without manual operation. In order to achieve a certain user experience, the voice wake-up system should meet the requirements of high accuracy and high stability.
Therefore, in order to improve the accuracy and stability of voice wakeup, it is necessary to provide a new voice wakeup method or system.
Disclosure of Invention
The invention aims to provide a voice awakening method and a voice awakening system based on semantic preservation, which can improve the accuracy and stability of voice awakening.
In order to achieve the purpose, the invention provides the following scheme:
a voice awakening method based on semantic preservation comprises the following steps:
acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;
training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network;
and utilizing the neural network of the streaming voice awakening system to identify voice data, and carrying out voice awakening correspondingly according to an identification result.
Optionally, the tagging, by using the keyword, the continuous acoustic feature frame and the determining of the streaming frame level tag specifically include:
marking a phoneme level label with reserved semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
the phoneme-level labels that preserve semantics are converted to streaming frame-level labels.
Optionally, the training the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine a streaming voice wake-up system neural network, before further comprising:
judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover the length of all keywords in the voice sample data;
if not, zero padding is carried out in front of the continuous acoustic characteristic frames so as to reach the set frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
Optionally, the training the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine a streaming voice wake-up system neural network, before further comprising:
and carrying out data enhancement processing on the continuous acoustic feature frames and the corresponding streaming frame level tags.
Optionally, the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine a streaming voice wake-up system neural network specifically includes:
and performing back propagation according to the recognition result of the neural network, further updating the parameters of the neural network, and finishing the training of the voice awakening neural network model.
A voice wake-up system based on semantic preservation, comprising:
the voice sample data acquisition module is used for acquiring voice sample data, extracting the characteristics of the voice sample data and determining the related information of a continuous acoustic characteristic frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
the streaming frame level label determining module is used for marking the continuous acoustic feature frames by using keywords and determining streaming frame level labels; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;
the streaming voice wake-up system neural network determining module is used for training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels and determining the streaming voice wake-up system neural network;
and the voice awakening module is used for identifying voice data by utilizing the neural network of the streaming voice awakening system and carrying out voice awakening correspondingly according to the identification result.
Optionally, the streaming frame level tag determining module specifically includes:
a phoneme level label determining unit, configured to mark a phoneme level label with retained semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
and the streaming frame level label determining unit is used for converting the phoneme level label with the reserved semantics into a streaming frame level label.
Optionally, the method further comprises:
the judging module is used for judging whether the continuous acoustic feature frames meet the set frame number; the frame number shall cover the length of all keywords in the voice sample data;
the zero filling module is used for filling zero in front of the continuous acoustic characteristic frames if the acoustic characteristic frames do not meet the preset frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
Optionally, the method further comprises:
and the data enhancement module is used for carrying out data enhancement processing on the continuous acoustic characteristic frames and the corresponding streaming frame level labels.
Optionally, the streaming voice wake-up system neural network determining module specifically includes:
and the stream type voice wake-up system neural network training unit is used for performing back propagation according to the recognition result of the neural network so as to update the parameters of the neural network and finish the training of the voice wake-up neural network model.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the voice awakening method and system based on semantic preservation, provided by the invention, the continuous acoustic characteristic frame is marked by using the keyword, the streaming frame level label is determined, the output frame with the label semantic preserved is used for identifying and training the neural network of the streaming voice awakening system, and further, once the keyword appears, a stable awakening state can be preserved for a certain time, so that false awakening can be effectively reduced, and the overall stability and accuracy of the voice awakening system are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a voice wake-up method based on semantic preservation according to the present invention;
fig. 2 is a schematic structural diagram of a voice wake-up system based on semantic preservation according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a voice awakening method and a voice awakening system based on semantic preservation, which can improve the accuracy and stability of voice awakening.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a voice wake-up method based on semantic preservation according to the present invention, and as shown in fig. 1, the voice wake-up method based on semantic preservation according to the present invention includes:
s101, acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of continuous acoustic feature frames; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
s102, marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags; and correspondingly marking the non-keyword speech segments containing the keywords as non-keyword semantic frame labels.
S102 specifically comprises the following steps:
marking a phoneme level label with reserved semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
the phoneme-level labels that preserve semantics are converted to streaming frame-level labels.
If how many phonemes each phoneme contains is not specified in the data set, and at which time each phoneme is located in the speech, it can be obtained by using a Montreal Forced Aligner tool. For the keyword data, starting from 2/3 where the last phoneme starts and ends, and ending to 1/2 segments where the last phoneme extends backward by a time, the segments are labeled as keyword semantic segments, and other segments are labeled as non-keyword semantic segments. For non-keyword speech, all times are labeled as non-keyword semantic segments.
S103, training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network; the input of the neural network is two-dimensional characteristics, the neural network is formed by stacking continuous acoustic characteristic frames according to the time sequence, and the total length of the time frames can cover each keyword sample in the training data set.
The neural network is formed by stacking a plurality of convolution layers, a full connection layer and softmax.
Before S103, further comprising:
judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover all the keyword lengths in the voice sample data;
if not, zero padding is carried out in front of the continuous acoustic characteristic frames so as to reach the set frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
Carrying out data enhancement processing on the continuous acoustic feature frames and the corresponding streaming frame level labels; the data enhancement processing comprises the following steps: and (5) adding noise.
During training, in order to ensure that the neural network is in streaming output, if the lengths of the acoustic characteristic frames of the samples in one batch are inconsistent, a plurality of times of the longest acoustic characteristic frame number are taken as the standard frame number T of the whole batch n After the acoustic characteristic frame of each sample, zero filling is carried out to reach the standard frame number; except the frame-level keyword semantic tags of the determined keywords in the S102, the tags corresponding to the remaining post zero-filling frames of each sample are all marked as frame-level non-keyword semantic frames; meanwhile, the frames T-1 are filled with zero in front of the acoustic feature frames, and the frames T-1 are also marked as frame-level non-keyword semantic labels.
In training, the input size of each sample should be (T) n + T-1). times.F, wherein (T) n + T-1) is the number of time frames, F is the number of features per frame, and the streaming frame level label length per sample is T n + T-1. According to the time frame sequence, sequentially taking the features with the size of T multiplied by F as the input of the feature extraction module to finally obtain T n The number of the frame output features after each frame output feature is flattened is F e So the output size of the feature extraction module for each sample is T n ×F e
The neural network is a combination of a full connection layer and softmax after feature extraction, serves as a classification layer, and outputs 1+ n classes including 1 non-keyword class and n keyword classes. The input of the classification layer is the output of the feature extraction module of the neural network, and the output size of the classification layer is T for each sample n ×(1+n)。
S103 specifically comprises the following steps:
and performing back propagation according to the recognition result of the neural network, further updating the parameters of the neural network, and finishing the training of the voice awakening neural network model.
The streaming frame level label length per sample is T n + T-1, with the neural network classification layer output for each sample as T n X (1 + n), T after streaming frame level tagging n A single frame level tag may be used for back propagation. Selecting all frames marked as keyword semantics in the frame level labels for the keywords, and performing back propagation on partial or all frames marked as non-keyword semantics; for non-keywords, select the last T in the streaming frame level tag n One non-key semantic frame is used for back propagation.
And S104, recognizing the voice data by utilizing a neural network of a streaming voice wake-up system, and performing voice wake-up correspondingly according to a recognition result.
Compared with the existing method, for example, compared with the method of using time movement to enhance data to recognize voices in different time periods, the stream type voice awakening model training method based on semantic preservation directly uses all output frames preserved with label semantics to participate in training, so that once keywords appear when a voice awakening system is actually deployed, a stable awakening state can be preserved for a certain time, and the overall stability and accuracy of the voice awakening system are improved.
Fig. 2 is a schematic structural diagram of a voice wake-up system based on semantic preservation according to the present invention, and as shown in fig. 2, the voice wake-up system based on semantic preservation according to the present invention includes:
a voice sample data obtaining module 201, configured to obtain voice sample data, perform feature extraction on the voice sample data, and determine relevant information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
a streaming frame level tag determination module 202, configured to mark the continuous acoustic feature frame with a keyword, and determine a streaming frame level tag; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;
the streaming voice wake-up system neural network determining module 203 is configured to train a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determine a streaming voice wake-up system neural network;
and the voice awakening module 204 is configured to utilize the streaming voice awakening system neural network to perform voice data identification, and perform voice awakening according to the identification result.
The streaming frame level tag determining module 202 specifically includes:
a phoneme level label determining unit, configured to mark a phoneme level label with retained semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
and the streaming frame level label determining unit is used for converting the phoneme level label with the reserved semantics into a streaming frame level label.
The invention provides a voice awakening system based on semantic preservation, which further comprises:
the judging module is used for judging whether the continuous acoustic feature frames meet the set frame number; the frame number shall cover the length of all keywords in the voice sample data;
the zero filling module is used for filling zero in front of the continuous acoustic characteristic frames if the acoustic characteristic frames do not meet the preset frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
The invention provides a voice awakening system based on semantic preservation, which further comprises:
and the data enhancement module is used for carrying out data enhancement processing on the continuous acoustic characteristic frames and the corresponding streaming frame level labels.
The streaming voice wake-up system neural network determining module 203 specifically includes:
and the stream type voice wake-up system neural network training unit is used for performing back propagation according to the recognition result of the neural network so as to update the parameters of the neural network and finish the training of the voice wake-up neural network model.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (10)

1. A voice awakening method based on semantic preservation is characterized by comprising the following steps:
acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;
training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network;
and utilizing the neural network of the streaming voice awakening system to identify voice data, and carrying out voice awakening correspondingly according to an identification result.
2. The voice wake-up method based on semantic preservation according to claim 1, wherein the labeling the continuous acoustic feature frames with the keywords to determine the streaming frame level tag specifically comprises:
marking a phoneme level label with reserved semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
the phoneme-level labels that preserve semantics are converted to streaming frame-level labels.
3. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level tags to determine the streaming voice wake-up system neural network further comprises:
judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover the length of all keywords in the voice sample data;
if not, zero filling is carried out in front of the continuous acoustic characteristic frames, and the set frame number is further reached; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
4. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level tags to determine the streaming voice wake-up system neural network further comprises:
and carrying out data enhancement processing on the continuous acoustic feature frames and the corresponding streaming frame level tags.
5. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine the neural network of the streaming voice wake-up system specifically comprises:
and performing back propagation according to the recognition result of the neural network, further updating the parameters of the neural network, and finishing the training of the voice awakening neural network model.
6. A voice wake-up system based on semantic preservation, comprising:
the voice sample data acquisition module is used for acquiring voice sample data, extracting the characteristics of the voice sample data and determining the related information of a continuous acoustic characteristic frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;
the streaming frame level label determining module is used for marking the continuous acoustic feature frames by using keywords and determining streaming frame level labels; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;
the streaming voice wake-up system neural network determining module is used for training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels and determining the streaming voice wake-up system neural network;
and the voice awakening module is used for identifying voice data by utilizing the neural network of the streaming voice awakening system and carrying out voice awakening correspondingly according to the identification result.
7. The voice wake-up system based on semantic preservation according to claim 6, wherein the streaming frame level tag determination module specifically comprises:
a phoneme level label determining unit, configured to mark a phoneme level label with retained semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;
and the streaming frame level label determining unit is used for converting the phoneme level label with the reserved semantics into a streaming frame level label.
8. The voice wake-up system based on semantic preservation according to claim 6, further comprising:
the judging module is used for judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover all the keyword lengths in the voice sample data;
the zero filling module is used for filling zero in front of the continuous acoustic characteristic frames if the acoustic characteristic frames do not meet the preset frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.
9. The voice wake-up system based on semantic preservation according to claim 6, further comprising:
and the data enhancement module is used for carrying out data enhancement processing on the continuous acoustic characteristic frames and the corresponding streaming frame level labels.
10. The voice wake-up system based on semantic preservation according to claim 6, wherein the neural network determination module of the streaming voice wake-up system specifically comprises:
and the stream type voice wake-up system neural network training unit is used for performing back propagation according to the recognition result of the neural network so as to update the parameters of the neural network and finish the training of the voice wake-up neural network model.
CN202210780418.3A 2022-07-05 2022-07-05 Voice awakening method and system based on semantic preservation Pending CN114863915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210780418.3A CN114863915A (en) 2022-07-05 2022-07-05 Voice awakening method and system based on semantic preservation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210780418.3A CN114863915A (en) 2022-07-05 2022-07-05 Voice awakening method and system based on semantic preservation

Publications (1)

Publication Number Publication Date
CN114863915A true CN114863915A (en) 2022-08-05

Family

ID=82627042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210780418.3A Pending CN114863915A (en) 2022-07-05 2022-07-05 Voice awakening method and system based on semantic preservation

Country Status (1)

Country Link
CN (1) CN114863915A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012206A (en) * 2023-10-07 2023-11-07 山东省智能机器人应用技术研究院 Man-machine voice interaction system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180151199A1 (en) * 2016-11-29 2018-05-31 Beijing Xiaomi Mobile Software Co., Ltd. Method, Device and Computer-Readable Medium for Adjusting Video Playing Progress
CN109862408A (en) * 2018-12-29 2019-06-07 江苏爱仕达电子有限公司 A kind of user speech identification control method for smart television voice remote controller
US20200020322A1 (en) * 2018-07-13 2020-01-16 Google Llc End-to-End Streaming Keyword Spotting
CN111429887A (en) * 2020-04-20 2020-07-17 合肥讯飞数码科技有限公司 End-to-end-based speech keyword recognition method, device and equipment
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113782009A (en) * 2021-11-10 2021-12-10 中科南京智能技术研究院 Voice awakening system based on Savitzky-Golay filter smoothing method
CN114566156A (en) * 2022-02-28 2022-05-31 恒玄科技(上海)股份有限公司 Keyword speech recognition method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180151199A1 (en) * 2016-11-29 2018-05-31 Beijing Xiaomi Mobile Software Co., Ltd. Method, Device and Computer-Readable Medium for Adjusting Video Playing Progress
US20200020322A1 (en) * 2018-07-13 2020-01-16 Google Llc End-to-End Streaming Keyword Spotting
CN109862408A (en) * 2018-12-29 2019-06-07 江苏爱仕达电子有限公司 A kind of user speech identification control method for smart television voice remote controller
CN111429887A (en) * 2020-04-20 2020-07-17 合肥讯飞数码科技有限公司 End-to-end-based speech keyword recognition method, device and equipment
CN113035231A (en) * 2021-03-18 2021-06-25 三星(中国)半导体有限公司 Keyword detection method and device
CN113782009A (en) * 2021-11-10 2021-12-10 中科南京智能技术研究院 Voice awakening system based on Savitzky-Golay filter smoothing method
CN114566156A (en) * 2022-02-28 2022-05-31 恒玄科技(上海)股份有限公司 Keyword speech recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
勒蕃: "《神经网络理论与应用研究》", 30 October 1996 *
黄德双: "《现代信息技术理论与应用》", 30 August 2002 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012206A (en) * 2023-10-07 2023-11-07 山东省智能机器人应用技术研究院 Man-machine voice interaction system
CN117012206B (en) * 2023-10-07 2024-01-16 山东省智能机器人应用技术研究院 Man-machine voice interaction system

Similar Documents

Publication Publication Date Title
CN108305634B (en) Decoding method, decoder and storage medium
CN111583909B (en) Voice recognition method, device, equipment and storage medium
CN109523986B (en) Speech synthesis method, apparatus, device and storage medium
WO2017076222A1 (en) Speech recognition method and apparatus
CN110797016B (en) Voice recognition method and device, electronic equipment and storage medium
CN108305616A (en) A kind of audio scene recognition method and device based on long feature extraction in short-term
CN109036471B (en) Voice endpoint detection method and device
JPH0772839B2 (en) Method and apparatus for grouping phoneme pronunciations into phonetic similarity-based context-dependent categories for automatic speech recognition
CN113035231B (en) Keyword detection method and device
CN111341305A (en) Audio data labeling method, device and system
CN108922521A (en) A kind of voice keyword retrieval method, apparatus, equipment and storage medium
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
CN111833844A (en) Training method and system of mixed model for speech recognition and language classification
CN111724766B (en) Language identification method, related equipment and readable storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN112614514A (en) Valid voice segment detection method, related device and readable storage medium
CN114863915A (en) Voice awakening method and system based on semantic preservation
CN113850291A (en) Text processing and model training method, device, equipment and storage medium
CN113593522A (en) Voice data labeling method and device
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
CN108538292A (en) A kind of audio recognition method, device, equipment and readable storage medium storing program for executing
CN115910046A (en) Voice recognition method and device, electronic equipment and storage medium
CN114141271B (en) Psychological state detection method and system
CN113838462B (en) Voice wakeup method, voice wakeup device, electronic equipment and computer readable storage medium
CN115512692A (en) Voice recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220805

RJ01 Rejection of invention patent application after publication