CN114141244A - Voice recognition technology based on audio media analysis - Google Patents

Voice recognition technology based on audio media analysis Download PDF

Info

Publication number
CN114141244A
CN114141244A CN202010922652.6A CN202010922652A CN114141244A CN 114141244 A CN114141244 A CN 114141244A CN 202010922652 A CN202010922652 A CN 202010922652A CN 114141244 A CN114141244 A CN 114141244A
Authority
CN
China
Prior art keywords
audio
order
media analysis
ste
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010922652.6A
Other languages
Chinese (zh)
Inventor
余城诚
彭德中
王骞
刘杰
张利君
银大伟
蒋瑞
付俊英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202010922652.6A priority Critical patent/CN114141244A/en
Publication of CN114141244A publication Critical patent/CN114141244A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention discloses a voice recognition technology based on audio media analysis, which classifies and recognizes the current audio and video information flow through audio characteristic extraction, heuristic rules and a voice recognition system. The audio features are extracted by 13-order MFCC, and the classification rules realize classification by SVM rules. Finally, the voice is recognized and output through a specific voice recognition engine.

Description

Voice recognition technology based on audio media analysis
Technical Field
The invention relates to the field of computer voice recognition, in particular to a method for recognizing voice in audio and video.
Background
Nowadays, with the rapid development of computer networks and multimedia technologies, multimedia information is increasing. As the main multimedia, video plays an important role in our daily life. Content-based video retrieval has attracted increasing research attention. Early research mostly focused on visual information, but these lower features hardly expressed high-level semantic concepts. Therefore, multi-mode based video content analysis is a challenging area of research. As an important component of video, audio can provide useful information for video content analysis. Audio may also provide important clues for particular areas such as news videos and sports videos.
The present invention introduces our audio content analysis framework that analyzes video content through automatic audio classification and speech recognition. And performs speech to text conversion.
Disclosure of Invention
The problem to be solved by the invention is how to apply audio analysis and speech recognition to video content analysis.
In order to achieve the aim of video content analysis, the invention adopts the technical scheme that:
first, we extract audio data from the video stream. Second, the audio data is divided into homogeneous segments. In this section, audio features based on the audio frame are extracted, including Mel Frequency Cepstral Coefficients (MFCCs), short-term energy, zero-crossing rate, and subband energy ratio. The frames are then classified into silence, music, and speech using a classifier based on rules and SVM. Four heuristic rules are then applied in the final audio segmentation. Finally, based on the audio segmentation information, speech segments and MFCC features to be used for speech recognition are extracted.
The method comprises the following steps:
1. audio feature extraction
The frame-level features we extract include: 13-order MFCC, 13-order first-order difference MFCC, zero-crossing rate, short-time energy and subband energy ratio. These functions are defined as follows:
1) mel Frequency Cepstrum Coefficient (MFCC)
Figure BDA0002667261090000021
Where K is the number of band-pass filters, skIs the mel-weighted spectrum after passing through the kth triangular band-pass filter, and L is the order of the cepstrum.
2) Zero Crossing Rate (ZCR)
A zero crossing may occur if the time domain symbols of consecutive samples are different. ZCR is defined as the number of zero crossings in a frame.
Figure BDA0002667261090000022
Where sgn (.) is a sign function and xi(m) is the ith frame discrete audio signal.
3) Short-term energy (STE)
STE is the total energy of the samples in the audio frame. It is defined as:
Figure BDA0002667261090000031
wherein xi(m) is the ith frame discrete audio signal. It can be used to distinguish between silence and non-silence.
4) Sub-band energy ratio
The Frequency Energy (FE) is the spectral power in a frame. The definition is as follows:
Figure BDA0002667261090000032
wherein F (omega) is FFT coefficient, | F (omega) & gtY2Is the power at the frequency of omega, omega0Is half the sampling frequency. The ratio of the sub-band power to the total power in a frame is defined as:
Figure BDA0002667261090000033
wherein HiAnd LiRespectively, the upper and lower limit of subband i. We divide the frequency domain of an audio signal into four sub-bands, each sub-band being spaced by 0, omega0/8],[ω0/8,ω0/4],[ω0/4,ω0/2],[ω0/2,ω0]。
2. Classifier
1)SVM
Given a set of training vectors belonging to two classes, (x)1,y1),…,(xl,yl) Wherein xi∈RnIs a feature vector, yiE { +1, -1} is a category label. Supporting vector sets
Figure BDA0002667261090000034
This is the training set (x)1,y1),…,(xl,yl) And finding the best decision function:
Figure BDA0002667261090000035
wherein alpha isiAnd b bare parameters, solution vectors, for the classifier
Figure BDA0002667261090000036
Referred to as support vectors, whereiIs non-zero.
2) Audio classification scheme
In the present invention, audio frames are divided into three categories: silence, music, and speech. Silencing is differentiated according to heuristic rules based on ZCR and STE. Then, music and speech are classified using an SVM-based classifier for non-silence frames. A detailed system of audio classification is shown in fig. 1.
Based on experimental observations, we propose a dual-threshold detection rule. The rule for distinguishing between silences is defined as:
STE<STEthreshold,ZCR<ZCRthreshold
if STE and ZCR are less than a predefined threshold, the frame is marked as muted.
3. Audio segmentation
Audio segmentation is the alignment of audio class clusters on a timeline. It aims at dividing the audio data into homogeneous segments. Story unit segmentation is performed using the segmented categories. Fig. 2 shows the process of audio segmentation.
4. Speech recognition
The speech recognition system comprises speech endpoint detection, feature extraction, acoustic and language models and a recognition engine. The diagram in fig. 3 shows the entire process.
Description of the drawings:
FIG. 1 is a flow chart of audio classification according to the present invention.
FIG. 2 is a flow chart of audio segmentation according to the present invention.
FIG. 3 is a diagram of a speech recognition system according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to examples to enable those skilled in the art to practice the invention with reference to the description.
It should be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other networks or combinations thereof.
The method for recognizing the voice in the audio and video comprises the following steps:
1) audio data is extracted from the video stream.
2) Extracting 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio:
mel-frequency cepstrum coefficient (MFCC):
Figure BDA0002667261090000051
zero Crossing Rate (ZCR):
Figure BDA0002667261090000052
short-time energy (STE):
Figure BDA0002667261090000053
sub-band energy ratio:
Figure BDA0002667261090000054
3) classifying through an SVM classifier:
Figure BDA0002667261090000055
4) silence is distinguished by a dual threshold detection rule:
STE<STEthreshold,ZCR<ZCRthreshold
5) the audio segmentation is performed by an audio clustering method aligned on the time line.
6) And recognizing and outputting through a speech recognition model.
While embodiments of the invention have been disclosed above, it is not limited to the applications listed in the description and the embodiments, which are fully applicable in all kinds of fields of application of the invention, and further modifications may readily be effected by those skilled in the art, so that the invention is not limited to the specific details without departing from the general concept defined by the claims and the scope of equivalents.

Claims (3)

1. A speech recognition technique based on audio media analysis, comprising the steps of:
1) extracting frame-level features including functions of 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio and the like;
2) classifying using a classifier of the SVM based classifier;
3) performing story unit segmentation using the segmented categories;
4) recognition is performed using a speech recognition system that includes speech endpoint detection, feature extraction, acoustic and language models, and a recognition engine.
2. The speech recognition technique for audio media analysis of claim 1, wherein extracting frame-level features comprising functions of 13 order MFCCs, 13 order first difference MFCCs, zero-crossing rate, short-time energy and subband energy ratio is determined by:
1) mel Frequency Cepstrum Coefficient (MFCC)
Figure FDA0002667261080000011
2) Zero Crossing Rate (ZCR)
Figure FDA0002667261080000012
3) Short-term energy (STE)
Figure FDA0002667261080000013
4) Sub-band energy ratio
Figure FDA0002667261080000021
5) The ratio of the sub-band power to the total power in a frame defines:
Figure FDA0002667261080000022
3. a speech recognition technique for audio media analysis according to claims 1-2 wherein the classification of the audio is determined by the following equation:
1)SVM
Figure FDA0002667261080000023
2) audio classification scheme
STE<STEthreshold,ZCR<ZCRthreshold (7)。
CN202010922652.6A 2020-09-04 2020-09-04 Voice recognition technology based on audio media analysis Pending CN114141244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010922652.6A CN114141244A (en) 2020-09-04 2020-09-04 Voice recognition technology based on audio media analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010922652.6A CN114141244A (en) 2020-09-04 2020-09-04 Voice recognition technology based on audio media analysis

Publications (1)

Publication Number Publication Date
CN114141244A true CN114141244A (en) 2022-03-04

Family

ID=80438767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010922652.6A Pending CN114141244A (en) 2020-09-04 2020-09-04 Voice recognition technology based on audio media analysis

Country Status (1)

Country Link
CN (1) CN114141244A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN102426835A (en) * 2011-08-30 2012-04-25 华南理工大学 Method for identifying local discharge signals of switchboard based on support vector machine model
CN105872855A (en) * 2016-05-26 2016-08-17 广州酷狗计算机科技有限公司 Labeling method and device for video files
CN106407960A (en) * 2016-11-09 2017-02-15 浙江师范大学 Multi-feature-based classification method and system for music genres
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN108416592A (en) * 2018-03-19 2018-08-17 成都信达智胜科技有限公司 A kind of high speed voice recognition methods
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
CN109766929A (en) * 2018-12-24 2019-05-17 重庆第二师范学院 A kind of audio frequency classification method and system based on SVM
CN111145786A (en) * 2019-12-17 2020-05-12 深圳追一科技有限公司 Speech emotion recognition method and device, server and computer readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN102426835A (en) * 2011-08-30 2012-04-25 华南理工大学 Method for identifying local discharge signals of switchboard based on support vector machine model
CN105872855A (en) * 2016-05-26 2016-08-17 广州酷狗计算机科技有限公司 Labeling method and device for video files
CN106407960A (en) * 2016-11-09 2017-02-15 浙江师范大学 Multi-feature-based classification method and system for music genres
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN108416592A (en) * 2018-03-19 2018-08-17 成都信达智胜科技有限公司 A kind of high speed voice recognition methods
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
CN109766929A (en) * 2018-12-24 2019-05-17 重庆第二师范学院 A kind of audio frequency classification method and system based on SVM
CN111145786A (en) * 2019-12-17 2020-05-12 深圳追一科技有限公司 Speech emotion recognition method and device, server and computer readable storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LI S Z: "Content-Based audio classification and re-trieval using SVM learning" *
孙文静;李士强;: "基于SVM的音频分类系统设计及实现" *
王骞;何培宇;徐自励;: "利用奇异谱分析的深度神经网络语音增强方法" *
陈戈珩;胡明辉;吴天华;: "基于支持向量机和HMM的音频信号分类算法" *

Similar Documents

Publication Publication Date Title
Theodorou et al. An overview of automatic audio segmentation
Kos et al. Acoustic classification and segmentation using modified spectral roll-off and variance-based features
Dhanalakshmi et al. Classification of audio signals using AANN and GMM
CN102486920A (en) Audio event detection method and device
JP2004229283A (en) Method for identifying transition of news presenter in news video
Kwon et al. Speaker change detection using a new weighted distance measure
Jothilakshmi et al. Unsupervised speaker segmentation with residual phase and MFCC features
Khan et al. Machine-learning based classification of speech and music
Kim et al. Comparison of MPEG-7 audio spectrum projection features and MFCC applied to speaker recognition, sound classification and audio segmentation
EP1520238A2 (en) Method for segmenting a compressed video
Vavrek et al. Broadcast news audio classification using SVM binary trees
Ghosal et al. Automatic male-female voice discrimination
Nyodu et al. Automatic identification of Arunachal language using K-nearest neighbor algorithm
Li et al. Instructional video content analysis using audio information
Krishnamoorthy et al. Hierarchical audio content classification system using an optimal feature selection algorithm
Oger et al. Transcription-based video genre classification
CN114141244A (en) Voice recognition technology based on audio media analysis
Abburi et al. Improved multimodal sentiment detection using stressed regions of audio
Feki et al. Audio stream analysis for environmental sound classification
Feng et al. An audio classification and speech recognition system for video content analysis
Barbedo et al. A robust and computationally efficient speech/music discriminator
Xie Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news
Liu et al. Classification of music and speech in mandarin news broadcasts
Li Nonexclusive audio segmentation and indexing as a pre-processor for audio information mining
Roy Speaker indexing using neural network clustering of vowel spectra

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220304

WD01 Invention patent application deemed withdrawn after publication