CN114141244A

CN114141244A - Voice recognition technology based on audio media analysis

Info

Publication number: CN114141244A
Application number: CN202010922652.6A
Authority: CN
Inventors: 余城诚; 彭德中; 王骞; 刘杰; 张利君; 银大伟; 蒋瑞; 付俊英
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2022-03-04

Abstract

The invention discloses a voice recognition technology based on audio media analysis, which classifies and recognizes the current audio and video information flow through audio characteristic extraction, heuristic rules and a voice recognition system. The audio features are extracted by 13-order MFCC, and the classification rules realize classification by SVM rules. Finally, the voice is recognized and output through a specific voice recognition engine.

Description

Voice recognition technology based on audio media analysis

Technical Field

The invention relates to the field of computer voice recognition, in particular to a method for recognizing voice in audio and video.

Background

Nowadays, with the rapid development of computer networks and multimedia technologies, multimedia information is increasing. As the main multimedia, video plays an important role in our daily life. Content-based video retrieval has attracted increasing research attention. Early research mostly focused on visual information, but these lower features hardly expressed high-level semantic concepts. Therefore, multi-mode based video content analysis is a challenging area of research. As an important component of video, audio can provide useful information for video content analysis. Audio may also provide important clues for particular areas such as news videos and sports videos.

The present invention introduces our audio content analysis framework that analyzes video content through automatic audio classification and speech recognition. And performs speech to text conversion.

Disclosure of Invention

The problem to be solved by the invention is how to apply audio analysis and speech recognition to video content analysis.

In order to achieve the aim of video content analysis, the invention adopts the technical scheme that:

first, we extract audio data from the video stream. Second, the audio data is divided into homogeneous segments. In this section, audio features based on the audio frame are extracted, including Mel Frequency Cepstral Coefficients (MFCCs), short-term energy, zero-crossing rate, and subband energy ratio. The frames are then classified into silence, music, and speech using a classifier based on rules and SVM. Four heuristic rules are then applied in the final audio segmentation. Finally, based on the audio segmentation information, speech segments and MFCC features to be used for speech recognition are extracted.

The method comprises the following steps:

1. audio feature extraction

The frame-level features we extract include: 13-order MFCC, 13-order first-order difference MFCC, zero-crossing rate, short-time energy and subband energy ratio. These functions are defined as follows:

1) mel Frequency Cepstrum Coefficient (MFCC)

Where K is the number of band-pass filters, s_kIs the mel-weighted spectrum after passing through the kth triangular band-pass filter, and L is the order of the cepstrum.

2) Zero Crossing Rate (ZCR)

A zero crossing may occur if the time domain symbols of consecutive samples are different. ZCR is defined as the number of zero crossings in a frame.

Where sgn (.) is a sign function and x_i(m) is the ith frame discrete audio signal.

3) Short-term energy (STE)

STE is the total energy of the samples in the audio frame. It is defined as:

wherein x_i(m) is the ith frame discrete audio signal. It can be used to distinguish between silence and non-silence.

4) Sub-band energy ratio

The Frequency Energy (FE) is the spectral power in a frame. The definition is as follows:

wherein F (omega) is FFT coefficient, | F (omega) & gtY²Is the power at the frequency of omega, omega₀Is half the sampling frequency. The ratio of the sub-band power to the total power in a frame is defined as:

wherein H_iAnd L_iRespectively, the upper and lower limit of subband i. We divide the frequency domain of an audio signal into four sub-bands, each sub-band being spaced by 0, omega₀/8]，[ω₀/8,ω₀/4]，[ω₀/4,ω₀/2]，[ω₀/2,ω₀]。

2. Classifier

1)SVM

Given a set of training vectors belonging to two classes, (x)₁,y₁),…,(x_l,y_l) Wherein x_i∈RⁿIs a feature vector, y_iE { +1, -1} is a category label. Supporting vector sets

This is the training set (x)₁,y₁),…,(x_l,y_l) And finding the best decision function:

wherein alpha is_iAnd b bare parameters, solution vectors, for the classifier

Referred to as support vectors, where_iIs non-zero.

2) Audio classification scheme

In the present invention, audio frames are divided into three categories: silence, music, and speech. Silencing is differentiated according to heuristic rules based on ZCR and STE. Then, music and speech are classified using an SVM-based classifier for non-silence frames. A detailed system of audio classification is shown in fig. 1.

Based on experimental observations, we propose a dual-threshold detection rule. The rule for distinguishing between silences is defined as:

STE<STE_threshold,ZCR<ZCR_threshold

if STE and ZCR are less than a predefined threshold, the frame is marked as muted.

3. Audio segmentation

Audio segmentation is the alignment of audio class clusters on a timeline. It aims at dividing the audio data into homogeneous segments. Story unit segmentation is performed using the segmented categories. Fig. 2 shows the process of audio segmentation.

4. Speech recognition

The speech recognition system comprises speech endpoint detection, feature extraction, acoustic and language models and a recognition engine. The diagram in fig. 3 shows the entire process.

Description of the drawings:

FIG. 1 is a flow chart of audio classification according to the present invention.

FIG. 2 is a flow chart of audio segmentation according to the present invention.

FIG. 3 is a diagram of a speech recognition system according to the present invention.

Detailed Description

The present invention is further described in detail below with reference to examples to enable those skilled in the art to practice the invention with reference to the description.

It should be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other networks or combinations thereof.

The method for recognizing the voice in the audio and video comprises the following steps:

1) audio data is extracted from the video stream.

2) Extracting 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio:

mel-frequency cepstrum coefficient (MFCC):

zero Crossing Rate (ZCR):

short-time energy (STE):

sub-band energy ratio:

3) classifying through an SVM classifier:

4) silence is distinguished by a dual threshold detection rule:

STE<STE_threshold,ZCR<ZCR_threshold

5) the audio segmentation is performed by an audio clustering method aligned on the time line.

6) And recognizing and outputting through a speech recognition model.

While embodiments of the invention have been disclosed above, it is not limited to the applications listed in the description and the embodiments, which are fully applicable in all kinds of fields of application of the invention, and further modifications may readily be effected by those skilled in the art, so that the invention is not limited to the specific details without departing from the general concept defined by the claims and the scope of equivalents.

Claims

1. A speech recognition technique based on audio media analysis, comprising the steps of:

1) extracting frame-level features including functions of 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio and the like;

2) classifying using a classifier of the SVM based classifier;

3) performing story unit segmentation using the segmented categories;

4) recognition is performed using a speech recognition system that includes speech endpoint detection, feature extraction, acoustic and language models, and a recognition engine.

2. The speech recognition technique for audio media analysis of claim 1, wherein extracting frame-level features comprising functions of 13 order MFCCs, 13 order first difference MFCCs, zero-crossing rate, short-time energy and subband energy ratio is determined by:

1) mel Frequency Cepstrum Coefficient (MFCC)

2) Zero Crossing Rate (ZCR)

3) Short-term energy (STE)

4) Sub-band energy ratio

5) The ratio of the sub-band power to the total power in a frame defines:

3. a speech recognition technique for audio media analysis according to claims 1-2 wherein the classification of the audio is determined by the following equation:

1)SVM

2) audio classification scheme

STE<STE_threshold,ZCR<ZCR_threshold (7)。