CN114863915A

CN114863915A - Voice awakening method and system based on semantic preservation

Info

Publication number: CN114863915A
Application number: CN202210780418.3A
Authority: CN
Inventors: 李郡; 付冠宇; 王啸; 尚德龙; 周玉梅
Original assignee: Zhongke Nanjing Intelligent Technology Research Institute
Current assignee: Zhongke Nanjing Intelligent Technology Research Institute
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-08-05

Abstract

The invention relates to a voice awakening method and system based on semantic preservation. The method comprises the following steps: acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length; marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags; training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network; and utilizing the neural network of the streaming voice awakening system to identify voice data, and carrying out voice awakening correspondingly according to an identification result. The invention can improve the accuracy and stability of voice awakening.

Description

Voice awakening method and system based on semantic preservation

Technical Field

The invention relates to the field of voice awakening, in particular to a voice awakening method and system based on semantic preservation.

Background

With the development of intelligent devices, voice interaction is widely applied, and a voice wake-up system is a key for enabling voice interaction. The goal of the voice wake-up system is to find the set keywords in the continuous voice input without manual operation. In order to achieve a certain user experience, the voice wake-up system should meet the requirements of high accuracy and high stability.

Therefore, in order to improve the accuracy and stability of voice wakeup, it is necessary to provide a new voice wakeup method or system.

Disclosure of Invention

The invention aims to provide a voice awakening method and a voice awakening system based on semantic preservation, which can improve the accuracy and stability of voice awakening.

In order to achieve the purpose, the invention provides the following scheme:

a voice awakening method based on semantic preservation comprises the following steps:

acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;

marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;

training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network;

and utilizing the neural network of the streaming voice awakening system to identify voice data, and carrying out voice awakening correspondingly according to an identification result.

Optionally, the tagging, by using the keyword, the continuous acoustic feature frame and the determining of the streaming frame level tag specifically include:

marking a phoneme level label with reserved semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;

the phoneme-level labels that preserve semantics are converted to streaming frame-level labels.

Optionally, the training the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine a streaming voice wake-up system neural network, before further comprising:

judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover the length of all keywords in the voice sample data;

if not, zero padding is carried out in front of the continuous acoustic characteristic frames so as to reach the set frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.

and carrying out data enhancement processing on the continuous acoustic feature frames and the corresponding streaming frame level tags.

Optionally, the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine a streaming voice wake-up system neural network specifically includes:

and performing back propagation according to the recognition result of the neural network, further updating the parameters of the neural network, and finishing the training of the voice awakening neural network model.

A voice wake-up system based on semantic preservation, comprising:

the voice sample data acquisition module is used for acquiring voice sample data, extracting the characteristics of the voice sample data and determining the related information of a continuous acoustic characteristic frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;

the streaming frame level label determining module is used for marking the continuous acoustic feature frames by using keywords and determining streaming frame level labels; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;

the streaming voice wake-up system neural network determining module is used for training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels and determining the streaming voice wake-up system neural network;

and the voice awakening module is used for identifying voice data by utilizing the neural network of the streaming voice awakening system and carrying out voice awakening correspondingly according to the identification result.

Optionally, the streaming frame level tag determining module specifically includes:

a phoneme level label determining unit, configured to mark a phoneme level label with retained semantics for a continuous acoustic feature frame of each voice sample data; semantic preserving phoneme level labels include: a keyword semantic segment and a non-keyword semantic segment;

and the streaming frame level label determining unit is used for converting the phoneme level label with the reserved semantics into a streaming frame level label.

Optionally, the method further comprises:

the judging module is used for judging whether the continuous acoustic feature frames meet the set frame number; the frame number shall cover the length of all keywords in the voice sample data;

the zero filling module is used for filling zero in front of the continuous acoustic characteristic frames if the acoustic characteristic frames do not meet the preset frame number; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.

Optionally, the method further comprises:

and the data enhancement module is used for carrying out data enhancement processing on the continuous acoustic characteristic frames and the corresponding streaming frame level labels.

Optionally, the streaming voice wake-up system neural network determining module specifically includes:

and the stream type voice wake-up system neural network training unit is used for performing back propagation according to the recognition result of the neural network so as to update the parameters of the neural network and finish the training of the voice wake-up neural network model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the voice awakening method and system based on semantic preservation, provided by the invention, the continuous acoustic characteristic frame is marked by using the keyword, the streaming frame level label is determined, the output frame with the label semantic preserved is used for identifying and training the neural network of the streaming voice awakening system, and further, once the keyword appears, a stable awakening state can be preserved for a certain time, so that false awakening can be effectively reduced, and the overall stability and accuracy of the voice awakening system are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a voice wake-up method based on semantic preservation according to the present invention;

fig. 2 is a schematic structural diagram of a voice wake-up system based on semantic preservation according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a voice wake-up method based on semantic preservation according to the present invention, and as shown in fig. 1, the voice wake-up method based on semantic preservation according to the present invention includes:

s101, acquiring voice sample data, performing feature extraction on the voice sample data, and determining related information of continuous acoustic feature frames; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;

s102, marking the continuous acoustic characteristic frames by using keywords, and determining a streaming frame level label; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags; and correspondingly marking the non-keyword speech segments containing the keywords as non-keyword semantic frame labels.

S102 specifically comprises the following steps:

If how many phonemes each phoneme contains is not specified in the data set, and at which time each phoneme is located in the speech, it can be obtained by using a Montreal Forced Aligner tool. For the keyword data, starting from 2/3 where the last phoneme starts and ends, and ending to 1/2 segments where the last phoneme extends backward by a time, the segments are labeled as keyword semantic segments, and other segments are labeled as non-keyword semantic segments. For non-keyword speech, all times are labeled as non-keyword semantic segments.

S103, training a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determining a streaming voice wake-up system neural network; the input of the neural network is two-dimensional characteristics, the neural network is formed by stacking continuous acoustic characteristic frames according to the time sequence, and the total length of the time frames can cover each keyword sample in the training data set.

The neural network is formed by stacking a plurality of convolution layers, a full connection layer and softmax.

Before S103, further comprising:

judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover all the keyword lengths in the voice sample data;

Carrying out data enhancement processing on the continuous acoustic feature frames and the corresponding streaming frame level labels; the data enhancement processing comprises the following steps: and (5) adding noise.

During training, in order to ensure that the neural network is in streaming output, if the lengths of the acoustic characteristic frames of the samples in one batch are inconsistent, a plurality of times of the longest acoustic characteristic frame number are taken as the standard frame number T of the whole batch _n After the acoustic characteristic frame of each sample, zero filling is carried out to reach the standard frame number; except the frame-level keyword semantic tags of the determined keywords in the S102, the tags corresponding to the remaining post zero-filling frames of each sample are all marked as frame-level non-keyword semantic frames; meanwhile, the frames T-1 are filled with zero in front of the acoustic feature frames, and the frames T-1 are also marked as frame-level non-keyword semantic labels.

In training, the input size of each sample should be (T) _n + T-1). times.F, wherein (T) _n + T-1) is the number of time frames, F is the number of features per frame, and the streaming frame level label length per sample is T _n + T-1. According to the time frame sequence, sequentially taking the features with the size of T multiplied by F as the input of the feature extraction module to finally obtain T _n The number of the frame output features after each frame output feature is flattened is F _e So the output size of the feature extraction module for each sample is T _n ×F _e 。

The neural network is a combination of a full connection layer and softmax after feature extraction, serves as a classification layer, and outputs 1+ n classes including 1 non-keyword class and n keyword classes. The input of the classification layer is the output of the feature extraction module of the neural network, and the output size of the classification layer is T for each sample _n ×（1+n）。

S103 specifically comprises the following steps:

The streaming frame level label length per sample is T _n + T-1, with the neural network classification layer output for each sample as T _n X (1 + n), T after streaming frame level tagging _n A single frame level tag may be used for back propagation. Selecting all frames marked as keyword semantics in the frame level labels for the keywords, and performing back propagation on partial or all frames marked as non-keyword semantics; for non-keywords, select the last T in the streaming frame level tag _n One non-key semantic frame is used for back propagation.

And S104, recognizing the voice data by utilizing a neural network of a streaming voice wake-up system, and performing voice wake-up correspondingly according to a recognition result.

Compared with the existing method, for example, compared with the method of using time movement to enhance data to recognize voices in different time periods, the stream type voice awakening model training method based on semantic preservation directly uses all output frames preserved with label semantics to participate in training, so that once keywords appear when a voice awakening system is actually deployed, a stable awakening state can be preserved for a certain time, and the overall stability and accuracy of the voice awakening system are improved.

Fig. 2 is a schematic structural diagram of a voice wake-up system based on semantic preservation according to the present invention, and as shown in fig. 2, the voice wake-up system based on semantic preservation according to the present invention includes:

a voice sample data obtaining module 201, configured to obtain voice sample data, perform feature extraction on the voice sample data, and determine relevant information of a continuous acoustic feature frame; the continuous acoustic feature frame related information includes: mel frequency cepstrum coefficient, frame shift and single frame length;

a streaming frame level tag determination module 202, configured to mark the continuous acoustic feature frame with a keyword, and determine a streaming frame level tag; the streaming frame level tag includes: keyword semantic frame tags and non-keyword semantic frame tags;

the streaming voice wake-up system neural network determining module 203 is configured to train a neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels, and determine a streaming voice wake-up system neural network;

and the voice awakening module 204 is configured to utilize the streaming voice awakening system neural network to perform voice data identification, and perform voice awakening according to the identification result.

The streaming frame level tag determining module 202 specifically includes:

The invention provides a voice awakening system based on semantic preservation, which further comprises:

The streaming voice wake-up system neural network determining module 203 specifically includes:

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A voice awakening method based on semantic preservation is characterized by comprising the following steps:

2. The voice wake-up method based on semantic preservation according to claim 1, wherein the labeling the continuous acoustic feature frames with the keywords to determine the streaming frame level tag specifically comprises:

3. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level tags to determine the streaming voice wake-up system neural network further comprises:

if not, zero filling is carried out in front of the continuous acoustic characteristic frames, and the set frame number is further reached; and the corresponding position of zero padding is marked as a non-keyword semantic frame label.

4. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level tags to determine the streaming voice wake-up system neural network further comprises:

5. The voice wake-up method based on semantic preservation according to claim 1, wherein the training of the neural network according to the continuous acoustic feature frames and the corresponding streaming frame level labels to determine the neural network of the streaming voice wake-up system specifically comprises:

6. A voice wake-up system based on semantic preservation, comprising:

7. The voice wake-up system based on semantic preservation according to claim 6, wherein the streaming frame level tag determination module specifically comprises:

8. The voice wake-up system based on semantic preservation according to claim 6, further comprising:

the judging module is used for judging whether the continuous acoustic feature frames meet the set frame number; the set frame number shall cover all the keyword lengths in the voice sample data;

9. The voice wake-up system based on semantic preservation according to claim 6, further comprising:

10. The voice wake-up system based on semantic preservation according to claim 6, wherein the neural network determination module of the streaming voice wake-up system specifically comprises: