CN114141244A - Voice recognition technology based on audio media analysis - Google Patents
Voice recognition technology based on audio media analysis Download PDFInfo
- Publication number
- CN114141244A CN114141244A CN202010922652.6A CN202010922652A CN114141244A CN 114141244 A CN114141244 A CN 114141244A CN 202010922652 A CN202010922652 A CN 202010922652A CN 114141244 A CN114141244 A CN 114141244A
- Authority
- CN
- China
- Prior art keywords
- audio
- order
- media analysis
- ste
- speech recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005516 engineering process Methods 0.000 title abstract description 4
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 5
- 238000011160 research Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000030279 gene silencing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The invention discloses a voice recognition technology based on audio media analysis, which classifies and recognizes the current audio and video information flow through audio characteristic extraction, heuristic rules and a voice recognition system. The audio features are extracted by 13-order MFCC, and the classification rules realize classification by SVM rules. Finally, the voice is recognized and output through a specific voice recognition engine.
Description
Technical Field
The invention relates to the field of computer voice recognition, in particular to a method for recognizing voice in audio and video.
Background
Nowadays, with the rapid development of computer networks and multimedia technologies, multimedia information is increasing. As the main multimedia, video plays an important role in our daily life. Content-based video retrieval has attracted increasing research attention. Early research mostly focused on visual information, but these lower features hardly expressed high-level semantic concepts. Therefore, multi-mode based video content analysis is a challenging area of research. As an important component of video, audio can provide useful information for video content analysis. Audio may also provide important clues for particular areas such as news videos and sports videos.
The present invention introduces our audio content analysis framework that analyzes video content through automatic audio classification and speech recognition. And performs speech to text conversion.
Disclosure of Invention
The problem to be solved by the invention is how to apply audio analysis and speech recognition to video content analysis.
In order to achieve the aim of video content analysis, the invention adopts the technical scheme that:
first, we extract audio data from the video stream. Second, the audio data is divided into homogeneous segments. In this section, audio features based on the audio frame are extracted, including Mel Frequency Cepstral Coefficients (MFCCs), short-term energy, zero-crossing rate, and subband energy ratio. The frames are then classified into silence, music, and speech using a classifier based on rules and SVM. Four heuristic rules are then applied in the final audio segmentation. Finally, based on the audio segmentation information, speech segments and MFCC features to be used for speech recognition are extracted.
The method comprises the following steps:
1. audio feature extraction
The frame-level features we extract include: 13-order MFCC, 13-order first-order difference MFCC, zero-crossing rate, short-time energy and subband energy ratio. These functions are defined as follows:
1) mel Frequency Cepstrum Coefficient (MFCC)
Where K is the number of band-pass filters, skIs the mel-weighted spectrum after passing through the kth triangular band-pass filter, and L is the order of the cepstrum.
2) Zero Crossing Rate (ZCR)
A zero crossing may occur if the time domain symbols of consecutive samples are different. ZCR is defined as the number of zero crossings in a frame.
Where sgn (.) is a sign function and xi(m) is the ith frame discrete audio signal.
3) Short-term energy (STE)
STE is the total energy of the samples in the audio frame. It is defined as:
wherein xi(m) is the ith frame discrete audio signal. It can be used to distinguish between silence and non-silence.
4) Sub-band energy ratio
The Frequency Energy (FE) is the spectral power in a frame. The definition is as follows:
wherein F (omega) is FFT coefficient, | F (omega) & gtY2Is the power at the frequency of omega, omega0Is half the sampling frequency. The ratio of the sub-band power to the total power in a frame is defined as:
wherein HiAnd LiRespectively, the upper and lower limit of subband i. We divide the frequency domain of an audio signal into four sub-bands, each sub-band being spaced by 0, omega0/8],[ω0/8,ω0/4],[ω0/4,ω0/2],[ω0/2,ω0]。
2. Classifier
1)SVM
Given a set of training vectors belonging to two classes, (x)1,y1),…,(xl,yl) Wherein xi∈RnIs a feature vector, yiE { +1, -1} is a category label. Supporting vector setsThis is the training set (x)1,y1),…,(xl,yl) And finding the best decision function:
wherein alpha isiAnd b bare parameters, solution vectors, for the classifierReferred to as support vectors, whereiIs non-zero.
2) Audio classification scheme
In the present invention, audio frames are divided into three categories: silence, music, and speech. Silencing is differentiated according to heuristic rules based on ZCR and STE. Then, music and speech are classified using an SVM-based classifier for non-silence frames. A detailed system of audio classification is shown in fig. 1.
Based on experimental observations, we propose a dual-threshold detection rule. The rule for distinguishing between silences is defined as:
STE<STEthreshold,ZCR<ZCRthreshold
if STE and ZCR are less than a predefined threshold, the frame is marked as muted.
3. Audio segmentation
Audio segmentation is the alignment of audio class clusters on a timeline. It aims at dividing the audio data into homogeneous segments. Story unit segmentation is performed using the segmented categories. Fig. 2 shows the process of audio segmentation.
4. Speech recognition
The speech recognition system comprises speech endpoint detection, feature extraction, acoustic and language models and a recognition engine. The diagram in fig. 3 shows the entire process.
Description of the drawings:
FIG. 1 is a flow chart of audio classification according to the present invention.
FIG. 2 is a flow chart of audio segmentation according to the present invention.
FIG. 3 is a diagram of a speech recognition system according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to examples to enable those skilled in the art to practice the invention with reference to the description.
It should be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other networks or combinations thereof.
The method for recognizing the voice in the audio and video comprises the following steps:
1) audio data is extracted from the video stream.
2) Extracting 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio:
mel-frequency cepstrum coefficient (MFCC):
zero Crossing Rate (ZCR):
3) classifying through an SVM classifier:
4) silence is distinguished by a dual threshold detection rule:
STE<STEthreshold,ZCR<ZCRthreshold
5) the audio segmentation is performed by an audio clustering method aligned on the time line.
6) And recognizing and outputting through a speech recognition model.
While embodiments of the invention have been disclosed above, it is not limited to the applications listed in the description and the embodiments, which are fully applicable in all kinds of fields of application of the invention, and further modifications may readily be effected by those skilled in the art, so that the invention is not limited to the specific details without departing from the general concept defined by the claims and the scope of equivalents.
Claims (3)
1. A speech recognition technique based on audio media analysis, comprising the steps of:
1) extracting frame-level features including functions of 13-order MFCC, 13-order first-order difference MFCC, zero crossing rate, short-time energy and subband energy ratio and the like;
2) classifying using a classifier of the SVM based classifier;
3) performing story unit segmentation using the segmented categories;
4) recognition is performed using a speech recognition system that includes speech endpoint detection, feature extraction, acoustic and language models, and a recognition engine.
2. The speech recognition technique for audio media analysis of claim 1, wherein extracting frame-level features comprising functions of 13 order MFCCs, 13 order first difference MFCCs, zero-crossing rate, short-time energy and subband energy ratio is determined by:
1) mel Frequency Cepstrum Coefficient (MFCC)
2) Zero Crossing Rate (ZCR)
3) Short-term energy (STE)
4) Sub-band energy ratio
5) The ratio of the sub-band power to the total power in a frame defines:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010922652.6A CN114141244A (en) | 2020-09-04 | 2020-09-04 | Voice recognition technology based on audio media analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010922652.6A CN114141244A (en) | 2020-09-04 | 2020-09-04 | Voice recognition technology based on audio media analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114141244A true CN114141244A (en) | 2022-03-04 |
Family
ID=80438767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010922652.6A Pending CN114141244A (en) | 2020-09-04 | 2020-09-04 | Voice recognition technology based on audio media analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114141244A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751692A (en) * | 2009-12-24 | 2010-06-23 | 四川大学 | Method for voice-driven lip animation |
CN102426835A (en) * | 2011-08-30 | 2012-04-25 | 华南理工大学 | Method for identifying local discharge signals of switchboard based on support vector machine model |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
CN106407960A (en) * | 2016-11-09 | 2017-02-15 | 浙江师范大学 | Multi-feature-based classification method and system for music genres |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN108416592A (en) * | 2018-03-19 | 2018-08-17 | 成都信达智胜科技有限公司 | A kind of high speed voice recognition methods |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109766929A (en) * | 2018-12-24 | 2019-05-17 | 重庆第二师范学院 | A kind of audio frequency classification method and system based on SVM |
CN111145786A (en) * | 2019-12-17 | 2020-05-12 | 深圳追一科技有限公司 | Speech emotion recognition method and device, server and computer readable storage medium |
-
2020
- 2020-09-04 CN CN202010922652.6A patent/CN114141244A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751692A (en) * | 2009-12-24 | 2010-06-23 | 四川大学 | Method for voice-driven lip animation |
CN102426835A (en) * | 2011-08-30 | 2012-04-25 | 华南理工大学 | Method for identifying local discharge signals of switchboard based on support vector machine model |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
CN106407960A (en) * | 2016-11-09 | 2017-02-15 | 浙江师范大学 | Multi-feature-based classification method and system for music genres |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN108416592A (en) * | 2018-03-19 | 2018-08-17 | 成都信达智胜科技有限公司 | A kind of high speed voice recognition methods |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109766929A (en) * | 2018-12-24 | 2019-05-17 | 重庆第二师范学院 | A kind of audio frequency classification method and system based on SVM |
CN111145786A (en) * | 2019-12-17 | 2020-05-12 | 深圳追一科技有限公司 | Speech emotion recognition method and device, server and computer readable storage medium |
Non-Patent Citations (4)
Title |
---|
LI S Z: "Content-Based audio classification and re-trieval using SVM learning" * |
孙文静;李士强;: "基于SVM的音频分类系统设计及实现" * |
王骞;何培宇;徐自励;: "利用奇异谱分析的深度神经网络语音增强方法" * |
陈戈珩;胡明辉;吴天华;: "基于支持向量机和HMM的音频信号分类算法" * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Theodorou et al. | An overview of automatic audio segmentation | |
Kos et al. | Acoustic classification and segmentation using modified spectral roll-off and variance-based features | |
Dhanalakshmi et al. | Classification of audio signals using AANN and GMM | |
CN102486920A (en) | Audio event detection method and device | |
JP2004229283A (en) | Method for identifying transition of news presenter in news video | |
Kwon et al. | Speaker change detection using a new weighted distance measure | |
Jothilakshmi et al. | Unsupervised speaker segmentation with residual phase and MFCC features | |
Khan et al. | Machine-learning based classification of speech and music | |
Kim et al. | Comparison of MPEG-7 audio spectrum projection features and MFCC applied to speaker recognition, sound classification and audio segmentation | |
EP1520238A2 (en) | Method for segmenting a compressed video | |
Vavrek et al. | Broadcast news audio classification using SVM binary trees | |
Ghosal et al. | Automatic male-female voice discrimination | |
Nyodu et al. | Automatic identification of Arunachal language using K-nearest neighbor algorithm | |
Li et al. | Instructional video content analysis using audio information | |
Krishnamoorthy et al. | Hierarchical audio content classification system using an optimal feature selection algorithm | |
Oger et al. | Transcription-based video genre classification | |
CN114141244A (en) | Voice recognition technology based on audio media analysis | |
Abburi et al. | Improved multimodal sentiment detection using stressed regions of audio | |
Feki et al. | Audio stream analysis for environmental sound classification | |
Feng et al. | An audio classification and speech recognition system for video content analysis | |
Barbedo et al. | A robust and computationally efficient speech/music discriminator | |
Xie | Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news | |
Liu et al. | Classification of music and speech in mandarin news broadcasts | |
Li | Nonexclusive audio segmentation and indexing as a pre-processor for audio information mining | |
Roy | Speaker indexing using neural network clustering of vowel spectra |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20220304 |
|
WD01 | Invention patent application deemed withdrawn after publication |