CN102014278A - Intelligent video monitoring method based on voice recognition technology - Google Patents

Intelligent video monitoring method based on voice recognition technology Download PDF

Info

Publication number
CN102014278A
CN102014278A CN 201010598197 CN201010598197A CN102014278A CN 102014278 A CN102014278 A CN 102014278A CN 201010598197 CN201010598197 CN 201010598197 CN 201010598197 A CN201010598197 A CN 201010598197A CN 102014278 A CN102014278 A CN 102014278A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
voice
monitoring
speech
warning
scene
Prior art date
Application number
CN 201010598197
Other languages
Chinese (zh)
Inventor
孙大飞
高勇
黄永华
Original Assignee
四川大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention relates to an intelligent video monitoring method based on voice recognition technology. In the method, the voice recognition technology utilized as an assistant monitoring means is introduced to a video monitoring system, so that the video monitoring system has a certain auditory function based on the visual function, thus constructing the video monitoring system with functions of active early warning, intelligent switching of monitored pictures and the like. The method comprises the following steps: establishing a sensitive word bank in advance; then processing the voice data in the monitored scene by adopting the voice recognition technology; detecting whether the voice data contains sensitive words, and if the sensitive words exist, sending a voice warning signal and a monitored picture switching signal and automatically switching the monitored pictures according to the monitored picture switching signal by a monitored picture switching device, thus solving the problems of fatiguability of monitoring personnel, failure to report and the like caused by detection depending on human eyes, overcoming the limitation of only depending on video information monitoring, improving the video monitoring efficiency, ensuring video monitoring to be more accurate, intelligent and humanized.

Description

一种基于语音识别技术的智能视频监控方法 Intelligent video surveillance method based on speech recognition technology

技术领域 FIELD

[0001] 本发明属于安防监控领域,尤其涉及了一种基于语音识别技术的智能视频监控方法。 [0001] The present invention belongs to the field of security monitoring, and particularly to an intelligent video surveillance based on speech recognition technology.

背景技术 Background technique

[0002] 自9.11事件以后,如何对国家重要安全部门和敏感的公共场合进行全天候、自动、实时的监控,已成为世界各国高度重视的一个问题。 [0002] Since 9/11, how to weather the national security sector important and sensitive public places, automatic, real-time monitoring, it has become a problem of world attention. 在这样的背景下,安防监控技术得到了广泛的应用和发展。 In this context, security monitoring technology has been widely used and developed. 在中国,安防监控行业应用市场每年保持20%左右的增长速度,日益增长的监控行业足可以体现出国家对安防监控的重视。 In China, security surveillance industry application market to maintain an annual growth rate of about 20%, increasing surveillance industry enough to reflect the importance of national security monitoring.

[0003] 视频监控一直被当作一种有效的监控手段而广泛应用于安防领域,即通过联网的方式分散放置摄像机进行场景记录监控,并集中显示,监控人员可以实时了解各个监控场景发生的事件,监控人员可根据监控画面中当事人的行为判断出其意图,遇到突发事件可迅速采取措施。 [0003] Video surveillance has been used as an effective means of monitoring is widely used in the field of security, that is a scene recorded by surveillance cameras placed in networked mode dispersion, and centralized display, real-time monitoring personnel can understand each monitoring event occurs scene , monitoring personnel can be judged based on the behavior of the monitor screen in the parties its intention, faced with an emergency can quickly take action.

[0004] 目前的视频监控系统中监控人员起着至关重要的作用,监控人员通过人眼检测实时监控每路视频。 [0004] The current video surveillance system to monitor staff plays a vital role in monitoring staff by the human eye to detect real-time monitoring of each video. 有关研究表明:即使是专业操作人员只要连续专注于多个监控屏幕超过20分钟,监控人员的注意力都会降低至不能满足监控要求的水平。 The study showed that: even professional operators should simply focus on multiple monitors screens more than 20 minutes, the attention of the monitoring personnel will be reduced to the level of monitoring can not meet the requirements. 经过长时间的工作,人员易疲劳、漏报现象多等问题会逐渐显现出来,大大降低视频监控的监督作用, 现有监控系统通常的做法是将摄像机的输出结果记录下来,当事故发生之后,保安人员才通过记录的结果观察发生的事实,但往往为时已晚。 After long hours of work, staff fatigue, negative phenomenon of many other issues will gradually emerge, greatly reducing the supervisory role of video surveillance, the existing monitoring system is common practice to record the output of the camera, when the accident occurred, security personnel are observable facts occurring by recording the results, but often too late. 另一方面,视频监控通常只针对视频信息进行处理,仅依靠视频信息并不能完全准确的反应出监控场景的实地情况,仍具有一定的局限性:受到视角的限制,对于摄像头监控范围以外区域发生的事情,视频监控则无能为力;此外,受到光照、天气的影响,尤其是夜间,视频监控作用会急剧下降。 On the other hand, video surveillance is usually processed only the video information, relying only video information can not accurately reflect the situation on the ground to monitor the scene, still has some limitations: limited viewing angle, camera monitoring for the area outside the range of occurrence things, video surveillance is powerless; in addition, by the effect of light weather, especially at night, video monitoring role would drop dramatically. 人眼检测的弊端和视频监控系统自身的缺陷制约着监控系统性能的发挥,降低了监控效率,往往会导致一些突发事件的遗漏,甚至造成不可挽回的损失。 Eye detection and video surveillance systems drawbacks of its own defects, and restricts the system performance monitor to play, reducing the efficiency of monitoring, often lead to missed some unexpected events, and even irreparable damage. 而目前我们希望的监控系统应能够每天连续24小时的实时智能监视,当异常情况发生时,系统能向保卫人员准确及时地发出警报,从而避免事故的发生,同时也希望减少人力、物力和财力的投入。 And now we want to monitoring system should be capable of continuous 24-hour real-time intelligence to monitor every day, when an exception occurs, the system can alert accurate and timely manner to the security personnel, so as to avoid accidents, but also want to reduce manpower, material and financial resources investment.

[0005] 语言作为人类最重要的交流工具,它自然方便、准确高效。 [0005] language as the most important human communication tool, it is naturally convenient, accurate and efficient. 在发生争吵、打架、呼救等情况时,所包含的语音信息尤为丰富,以此为据,用语音识别技术处理某些监控场景中的语音数据也可当作一种重要的安防监控手段。 In the case of quarrel, fight, cry for help, etc., included in the voice information is particularly rich, as a data processing voice data with voice recognition technology to monitor certain scenes can also be used as an important means of security monitoring. 特别是近二十年来,语音识别技术取得显著进步,开始从实验室走向市场。 Especially in the last two decades, speech recognition technology has made significant progress, starting from the lab to the marketplace. 人们预计,未来十年内,语音识别技术将进入工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域。 It is expected that within the next ten years, voice recognition technology will enter all areas of industry, home appliances, communications, automotive electronics, medical, family services, consumer electronics products. 作为智能计算机研究的主导方向和人机语音通信的技术关键,语音识别技术一直受到各国科学界的广泛关注。 As a key technology direction of leading computer research of human intelligence and voice communications, speech recognition technology has been widespread concern in the scientific community of nations. 如今,随着语音识别技术的研究突破,其对计算机的发展和社会生活的重要性日益凸现出来。 Now, with research breakthrough speech recognition technology, its importance to the development of computer and social life increasingly apparent. 发明内容 SUMMARY

[0006] 为了解决视频监控存在的问题,本发明提供一种安防监控新方法,在现有的视频监控技术上引入了语音识别技术,利用音频信息和视频信息相对独立处理,充分发挥各自的优势,进行监控手段相互补充,构建具有主动预警、监控画面智能切换功能的新型安防监控系统。 [0006] In order to solve the problem existing video surveillance, the present invention provides a novel method for security monitoring, the introduction of speech recognition technology in the conventional video surveillance technology, audio information and video information by using independent processing, full play to their strengths , monitoring tools complement each other, building with active warning, monitoring of new security monitoring system screen intelligent switching function. 从而使监控系统在具备“视觉”功能的基础上拥有一定的“听觉”功能,解决了仅依靠视频信息监控的局限性。 So that the monitoring system has a certain amount of "auditory" function on the basis of it has the "vision" of the function to solve the limitations of relying only video surveillance. 本发明采用语音识别技术处理监控场景中的语音数据,对其中敏感词汇进行预警,实现监控系统主动预警,并由预警信号触发监控画面自动切换,解决了因工作时间长导致的监控人员疲劳、漏报现象多等问题,提升了视频监控的效率,使视频监控系统性能更好的发挥。 The present invention employs a speech recognition speech data processing techniques to monitor the scene, for which early warning sensitive words, active warning monitoring system implemented by the warning signal triggers automatically switch the monitor screen, the monitoring personnel to solve the fatigue caused by working a long time, the drain newspaper phenomenon of many other issues, improve the efficiency of video surveillance, video surveillance system performance better play.

[0007] 上述新型安防监控系统是在数字视频监控系统基础上加入了语音处理识别模块以及预警判别模块,由于采用音频和视频相对独立处理,因而便于原有视频监控设备进行升级和更新。 [0007] The novel security monitoring system is added to the digital video surveillance system based on the speech recognition module and the alarm processing module determines, since relatively independent audio and video processing, thereby facilitating the original video monitoring equipment upgrades and updates. 系统工作之前,针对需要监控场景中所使用敏感词汇集中建立语音模版库,语音库内容可根据不同的监控场景选择不同的词汇。 Before the system work, the need to monitor the scene for sensitive words used in the template library focused on establishing a voice, speech database contents can choose different words according to different monitoring scenarios. 例如,针对发生争吵、打架、 呼救等情况中所使用的“救命”、“救人”、“来人啊”、“打架了”等。 For example, for a quarrel occurred, fighting, crying for help and other "help", "rescue", "come ah" as used in the case of "fight" and so on. 系统工作时,每路监控场景的视频信息和语音信息分别用两个通道采集。 When the system works, video information and voice information for each channel monitored scene were collected by two channels. 对视频信息进行编码并转换格式,将处理后的视频数据通过专网或局域网送至监控室显示并且保存;语音信息送往语音处理识别模块进行语音识别。 Encoding video information and format conversion, the video data processed by the private network or LAN to the control room and store display; processing the speech information to the speech recognition module for speech recognition. 之后,预警判别模块会对监控场景中语音数据识别结果进行辨别,排除无关语音信息的干扰,对已建立的语音模版库中包含的敏感词汇进行预警,并发出一个预警控制信号。 After that, the warning determination module will monitor the speech recognition results data in scene discriminate, exclude unrelated speech interference information on sensitive words has been established voice template library included early warning, control and issue a warning signal. 用预警控制信号来触发画面切换装置,主监控屏幕画面将根据发出的预警控制信号来源进行场景画面间的切换。 A control signal to trigger warning screen switching means, the main monitor screen of the control signal to switch between the source of scenes according to the issued warning. 由此达到主动预警以及监控画面智能切换的功能。 So as to achieve the initiative early warning and monitoring screen smart switching.

[0008] 本发明采用的技术方案包括如下6个循环步骤: [0008] The present invention adopts the technical solution comprising 6 cycles following steps:

1.监控系统工作以前先建立语音模板库,针对监控场景中出现的敏感词汇集中建立语音模版库,需要录音人数在几十人左右,采集多位男声和女声的语音样本作为训练数据,内容可根据不同的监控场景选择不同的词汇,例如可针对发生争吵、打架、呼救等情况中所用使用的“救命”、“救人”、“来人啊”、“打架了”等标志性词作为录制内容。 1. Monitoring system works before first establish a voice template library, focus on building vocabulary speech template library for sensitive monitoring the scene occurs, you need to record the number in the tens of people, collect speech samples a number of male and female voices as the training data, content choose different words according to different monitoring scenarios, for example, for the case of quarreling, fighting, crying for help, etc. used in the use of "help", "rescue", "come ah", "fight" and other iconic words as recorded content. 录音分阶段进行,训练语料内容基本为孤立词和短句子。 Recording stages, training corpus for the content of the basic isolated words and short sentences. 语音模版库先由每个录音者的语音样本数据分别训练,得到多个参考模板,每个参考模板是多个词汇的隐马尔可夫模型(hidden Markov model, HMM)集合,建立的HMM模型不仅包括初始状态概率、状态转移概率矩阵、观测概率矩阵三个参数,还包括状态转移次数、状态输出矢量数和状态数目共6个参数,最后通过模型合并重估的方法将多个参考模板合并成一个, 完成建库; Voice template library were training first by recording the voice sample data of each person to obtain a plurality of reference templates, each reference template is a multiple terms of hidden Markov models (hidden Markov model, HMM) collection, HMM model is not only includes an initial state probability, the state transition probability matrix, the observation probability matrix three parameters, further comprising a transition state number, state the number of states and the number of output vectors a total of six parameters, and finally by a process model merge revaluation the combined into a plurality of reference templates a completed building a database;

2.系统开始工作,利用声音采集装置和摄像装置分别采集监控场景的视频信息和语音信息; 2. The system begins to work, and the image pickup means by voice collecting means collect the scene monitoring video information and voice information;

3.对视频信号进行编码并进行格式转换,将处理后的视频数据通过专网或局域网送至监控室显示并且保存,语音信息则送往音频处理识别模块经过一系列处理后进行语音识别,处理顺序为采样、量化、分帧、加窗、预加重、端点检测、提取语音特征、 倒谱均减(cepstral mean subtraction, CMS)、语音识另ij ; 3. A video signal encoding and format conversion, display of the video data processed by the control room to the private network or LAN and the saved voice information is sent to the audio processing module identification through a series of voice recognition post-treatment process sequentially sampling, quantization, sub-frame, windowing, pre-emphasis, endpoint detection, speech feature extraction, are cepstral Save (cepstral mean subtraction, CMS), the other ij of voice recognition;

4. 将语音识别的结果送到预警判别模块进行判别,本设计的判别算法使用基于反词模型的拒识方法,对每个关键词模型都训练相应的反词模型,反词模型主要用其它与关键词极易混淆的语音数据训练而成,反词模型具有和关键词模型相同的结构,针对场景中出现的、语音库中未包括的正常语音信息会予以排除,对监控场景中出现的、 并在语音库中包含的敏感词汇则由预警模块产生预警控制信号并报警,实现主动预警功能; 4. The speech recognition results to the warning determination module determines, using the discrimination algorithm of the present design anti-rejection method based on word model for each keyword corresponding inverse models trained word models, anti-word model with the other major Key words easily confused with the voice data from the training, and the anti-word keyword model with the same model structure, appearing for the scene, the voice library not included in a normal voice message will be excluded, appears on the monitor scene , by the alarm module and sensitive words included in the speech database and generates an alarm signal warning control, for active warning function;

5.预警控制信号触发画面切换装置,主监控屏幕画面切换为与发出预警控制信号相匹配的监控画面以供监控人员分析,实现监控画面智能切换功能; The warning signal triggers the control means to switch the screen, the main screen is switched to the monitor signal matches the monitor screen for control and monitoring early warning analyzed, switching function to achieve intelligent monitor screen;

6.完成一次检测后,重复步骤2到5,进行下一次检测。 6. After the first test, repeat steps 2-5, for next detection.

[0009] 本发明的有益效果是:利用音频信息的主动预警功能弥补了监控设备视角范围限制以及视频监控受到光照、天气等自然环境的影响。 [0009] the advantages are: the use of audio information to make up for the initiative early warning function monitoring equipment as well as video surveillance viewing angle range limitation affected by light, weather and other natural environment. 预警控制信号实现的监控画面智能切换功能解决了人眼检测存在的问题,避免了监控人员长时间专注多个屏幕而导致注意力下降,不易出现事故,大大提高监控效率,使得视频监控更加准确、智能、人性化,同时也减少雇佣大批监视人员所需要的人力、物力和财力的投入。 Warning signal control monitor screen realization of intelligent switching capabilities to solve the problems of the human eye to detect problems and avoid the monitoring personnel to focus more screen time and lead to decreased attention, less prone to accidents, greatly improving the efficiency of monitoring, video surveillance making more accurate, intelligent, humane, but also reduce the input of manpower, material and financial resources to hire a large number of surveillance personnel need.

附图说明 BRIEF DESCRIPTION

[0010] 图1为基于语音识别技术的智能视频监控系统构成示意图。 [0010] FIG. 1 is a schematic configuration intelligent video surveillance system based on speech recognition technology.

[0011] 图2为图1所示的语音处理识别模块原理框图。 [0011] FIG. 2 is a schematic block diagram of a speech recognition processing module shown in FIG.

[0012] 图3为图2所示的预处理特征提取模块原理框图。 [0012] FIG. 3 is a schematic block diagram of a preprocessing module to extract the features shown in FIG. 2.

[0013] 图4为监控系统中语音通道工作流程图。 [0013] FIG. 4 is a flow chart of the monitoring system in the speech channel.

[0014] 图5、图6为采用本发明监控方法一种应用示意图。 [0014] FIG. 5, FIG. 6 is a monitoring method using the present invention is an application schematic.

[0015] 图1为本发明所提供的基于语音识别技术的智能视频监控系统构成示意图。 Based on speech recognition technology intelligent video surveillance system [0015] Figure 1 is a schematic configuration provided herein. 监控系统的一路架构由监控由摄像装置(101)、视频信号编码模块(102),监控画面显示器(103),声音采集装置(104)、语音处理识别模块(105)、预警判别模块(106)、预警指示装置(107)组成。 Way architecture monitoring system by the monitoring by the imaging means (101), a video signal coding module (102), the monitor screen display (103), sound collection means (104), the voice processing recognition module (105), alarm determination module (106) , warning indicating means (107) components. 此外,各路的场景信息的传送还需要局域网或专网(301)进行传送, 公共场景显示使用主监控屏幕(401)以及换面切换装置(501)。 In addition, the brightest scene information transfer needs a local or private network (301) for transmission, using the master public scene display screen monitor (401) and Plane switching means (501).

[0016] 图2为本发明监控系统语音通道核心处理部分语音处理识别模块(105)的架构图,语音处理识别模块(105)结构包括两大部分:语音库录入和模式识别。 Speech channel voice processing core processing part recognition module (105) architecture diagram [0016] FIG. 2 monitoring system of the present invention, the speech recognition processing module (105) structure comprises two parts: a voice input and pattern recognition database. 语音库录入包括以下几个模块,训练数据(1051)、预处理特征提取(1052)、参考模版训练(1053)、 参考模版(1054);模式识别部分包括:预处理特征提取(1056)、模版匹配(1057)以及语音识别(1058),其中预处理特征提取(1052)与预处理特征提取(1056)功能完全相同。 Input speech database includes the following modules, training data (1051), feature extraction pretreatment (1052), the reference template training (1053), the reference template (1054); pattern recognition part comprises: preprocessing feature extraction (1056), the template match (1057) and voice recognition (1058), wherein the pre-processing feature extraction (1052) feature extraction and preprocessing (1056) function is the same.

[0017] 图3为图2所示的预处理特征提取原理框图,语音数据依次经过以下处理:采样(IO52A)、量化(IO52B)、分帧(IO52C)、加窗(IO52D)、预加重(1052E)、端点检测(1052F)、特征提取(1052G)、倒谱均减(1052H)。 [0017] FIG. 3 is a pre-extracted features shown in block diagram in FIG. 2, the voice data sequentially through the following processes: sampling (IO52A), quantization (IO52B), sub-frame (IO52C), windowing (IO52D), pre-emphasis ( 1052E), endpoint detection (1052F), feature extraction (1052G), cepstral average Save (1052H).

[0018] 下面结合附图对本发明所提供的视频监控方法的实施方式做进一步说明。 [0018] The following drawings further illustrate the embodiments do combine video surveillance method of the present invention is provided. 具体实施方式 detailed description

[0019] 系统工作前需要事先建立语音模板库,建库工作流程如图2中语音库录入部分所示。 [0019] Before the system work requires prior establishment of a voice template library, part of the input speech library workflow building a database 2 shown in FIG. 考虑到本发明实用场合,不能使用说话人自适应的方法(此方法每次使用前都需要使用者进行训练,且训练好以后只能供训练者使用)达到非特定人识别的目的,所以必须采集大量人的语音样本作为训练数据,需要录音人数在几十人左右,分别采集多位男声和女声的语音样本作为训练数据(1051),训练数据(1051)的内容可根据监控场景中的突发事件所含的敏感词汇来制定。 Considering the practical applications of the invention, the method can not use the speaker adaptation (this method requires the user every time training before use, and can only be good training for future trainers use) the purpose of the non-specific recognition, it is necessary a large number of people collecting voice samples as training data, you need to record the number in the tens of people, speech samples were collected from a number of male and female voices as training data (1051), the contents of the training data (1051) can be sudden monitor the scene according to sensitive words contained in hair event to develop. 例如可针对发生争吵、打架、呼救等情况中所用使用的“救命”、“救人”、“来人啊”、“打架了”等标志性词作为录制内容。 For example, the iconic word for quarrel occurred, fighting, crying for help and other "help", "rescue" cases used, a "come ah", "fight" and as a recording. 录音分3〜5个阶段进行,训练语料内容基本为孤立词和短句子。 Recording points were 3 to 5 stages, training corpus for the content of the basic isolated words and short sentences. 分阶段录音是由于语音的动态范围很大,不同说话人的语音,甚至是同一说话人在不同时间和场合的语音都有很大的不同,所以库中应该尽量包含多样的语音信息,从而保持高的识别率。 Phased recording is due to the large dynamic range of voice, a different speaker's voice, and even the same speaker voice is very different at different times and places have, so the library should try to include a variety of voice messages, thereby maintaining high recognition rate. 在每个阶段中, 每个词每人录5〜10次。 At each stage, every word recorded 5 to 10 times per person.

[0020] 如图3所示,预处理特征提取(1052)对训练数据如下处理,采样(1052A)、量化(1052B)、分帧(1052C)、加窗(1052D)、预加重(1052E)、端点检测(1052F)、特征提取(1052G)、倒谱均减(1052H)处理。 [0020] 3, feature extraction pretreatment (1052) a process for the training data, sampling (1052A), quantization (1052B), sub-frame (1052 c), windowing (1052D), pre-emphasis (1052E), endpoint detection (1052F), feature extraction (1052G), cepstral average Save (1052H) process.

[0021] 首先将训练数据(1051)进行采样(1052A)和量化(1052B),数字化以后的语音信号实际上是一个时变信号,但其在IOms〜30ms短时间内是平稳的,为了得到短时的语音信号,要对语音信号进行加窗(1052D)操作,窗函数平滑的在语音信号上滑动,将语音信号分成帧。 [0021] First, the training data (1051) is sampled (1052A) and a quantization (1052B), subsequent digitized voice signal is actually a time-varying signal, but which is stable in IOms~30ms short time, in order to obtain short the voice signal, to be windowed speech signal (1052D) operation, the smoothing window function is a speech signal on the slide, the speech signal into frames. 分帧(1052C)可以连续,也可以采用交叠分段的方法,交叠部分称为帧移,帧移一般选为窗长的1/2,窗函数选择汉明窗(hamming),即: Framing (1052 c) continuously, a method may be adopted overlapping segments, called frames overlapped portion shifting, shifting the frame generally chosen window length 1/2 Hamming window selection window function (Hamming), namely:

Figure CN102014278AD00071

例如,当采样率为8kHz,故取汉明窗的长度为128,帧长为128个点,帧移为64。 For example, when the length of 8kHz sampling rate, it takes a Hamming window 128, the frame length is 128 points, 64 frame shift. 之后,还要对语音信号进行预加重(1052E)。 After that, even the speech signal pre-emphasis (1052E). 由于语音信号的平均功率谱受声门激励和口鼻辐射的影响,所以高频端大约在800Hz以上按_6dB/倍频程跌落,为此要在预处理中进行预加重(1052E)。 Since the average power spectrum of the speech signal and glottal excitation muzzle influence by radiation, so the high-frequency side by about _6dB / octave fall, this should be pre-emphasis (1052E) at the preprocessing above 800Hz. 其目的就是6dB/倍频程的提升高频特性,使信号的频谱变得平坦,便于进行频谱分析或声道参数分析。 Its purpose is 6dB / octave enhance the frequency characteristics, so that the signal spectrum becomes flat, or facilitate spectrum analysis channel parameter analysis. 预加重滤波器一般是一阶的,形式如下: A pre-emphasis filter is typically of the order, the following form:

Figure CN102014278AD00072

式中,《值接近1,典型的取值在0.94〜0.97之间。 In the formula, "a value close to 1, a typical value of between 0.94~0.97. 经过预加重(1052Ε)处理后, 端点检测(1052F)会判别出语音段的起始帧和结束帧,保证了声音采集装置(1051)采集的语音数据为一段完整语音,之后进行特征提取(1052G)。 After pre-emphasis (1052Ε) treatment, endpoint detection (1052F) will be determined the starting frame and the ending frame of speech segment, to ensure the sound collection means (1051) for the period of voice data acquired full speech, then the feature extraction (1052G ). 语音特征选取Mel频率的倒谱系数(mel frequency cepstram coefficient, MFCC)。 Voice Feature Selection inverted Mel frequency spectral coefficients (mel frequency cepstram coefficient, MFCC). MFCC参数是将人耳的听觉感知特性和语音的产生机制相结合,因此,此特征参数可以更好的反映出声信号的特性,从而提高系统对噪声的鲁棒性和系统的识别率。 MFCC parameters to the auditory perception of the speech production mechanism characteristics and combined, so this characteristic feature parameters can better reflect the acoustic signal, so as to improve the recognition rate of the system and the robustness of the system noise. 可以对每帧数据提取能量、MFCC和MFCC 的一阶差分特征等多维特征用于训练参考模版(1053),直接得到的MFCC特征为静态特征,差分后得到相应的动态特征。 Each frame may be extracted energy data, a multidimensional feature order Difference MFCC MFCC and the like used for training reference template (1053), MFCC feature is directly obtained as the static characteristics, dynamic characteristics after the corresponding difference. 提取MFCC特征后,我们对MFCC特征使用倒谱均减(1052H)的方法进行处理。 After MFCC feature extraction, we use MFCC feature are cepstral Save (1052H) is a method for processing.

[0022] 众所周知,当一个语音识别系统被应用到与其训练环境不同的测试环境中时性能明显下降。 [0022] is well known, when a different speech recognition system is applied to the test environment to its performance training environment significantly decreased. 尽管上述的这种下降通常是由于受到环境中加性噪声的影响,但传输信道不同而带来的信道畸变会强烈的影响到语音信号的短时频谱。 Despite this decrease above is usually due to the influence of additive noise environment, but different transmission channels and channel distortion brought about a strong impact will be the short-time spectrum of the speech signal. 由于语音识别系统的模版匹配直接或间接的依赖语音信号的短时频谱分析结果,所以语音识别系统也明显地受到传输信道畸变的影响,使得其性能下降。 Since the short-time spectrum analysis result of the template matching dependent speech recognition system is a speech signal directly or indirectly, the speech recognition system is also significantly affected by the transmission channel distortion, such that performance degradation. 在本发明提供的新型安防监控系统中,数据传输均为有线信道,一旦传输线路确定,信道参数也将确定,为了克服不同设备线路信道的影响,所以在此使用倒谱均减(1052H)的方法解决数据训练环境与实际使用环境中信道的不匹配导致识别率下降的影响。 In the new security monitoring system of the present invention provides the data transmission are wired channel, once the transmission line is determined, the channel parameters will be determined, in order to overcome the effects of different link channel device, it is used herein cepstral average Save (1052H) of the solution training environment data and the actual use of the environment does not match the channel leads to affect the recognition rate of decline. 下面将对CMS原理进行详细的阐述: CMS will conduct the following principles in detail:

设信号y[n]是一个信号x[n]经过滤波器h[n]后的输出。 Reset signal y [n] is a signal x [n] through the output filter h [n] after. 在倒谱域上用一个矢量h表 On the cepstral domain by a vector table h

示滤波器的影响,其每个元素~为 Effects of the filter shown, each of the elements -

Figure CN102014278AD00081

式中,B为Mel频带的个数,|饵%)|是h[n]的频带响应的第k个频带的幅值。 The amplitude of the k-th band band response is h [n] of | formula, B is the number of Mel band, | bait%). 由于 due to

滤波器在时域上对信号是卷积的作用,变换到频域后卷积运算变为乘法运算,取对数后进一步变为加法运算,则 A filter in the time domain signal is a convolution effect, transformed into the frequency domain convolution operation becomes multiplication, logarithmic taken further to a summation, then

Figure CN102014278AD00082

因此,样本平均值j;为: Thus, the sample average value J; of:

Figure CN102014278AD00083

CMS的方法就是将每个&矢量都减去卩,从而获得校正后的倒谱矢量九,即: CMS approach is to have each vector & minus Jie to obtain cepstrum vector corrected Nine, namely:

Figure CN102014278AD00084

这样,经过CMS后λ等于校正后的语音倒谱天,从而看出CMS抑制了通道畸变的影响,预处理特征提取(1052)结束。 Thus, after CMS [lambda] is equal to the corrected speech cepstrum days to see the effect of suppressing CMS channel distortion, feature extraction pretreatment (1052) ends.

[0023] 经过预处理特征提取(1052)处理之后,则可将处理过的训练数据(1051)训练成参考模版(1054),训练之后的参考模版是多个敏感词汇的集合,训练使用连续混合高斯HMM模型,HMM模型采用8状态,每个状态由3个单高斯分布的线性组合来描述帧特征在特征空间中的分布。 [0023] After the preprocessed feature extraction (1052) process, may be processed training data (1051) into the reference training templates (1054), then the reference template is a set of a plurality of training words sensitive, trained to use continuous mixing Gaussian HMM model, HMM model using the 8 states, each combination of three linear single Gaussian distribution to describe the distribution frame feature in the feature space. 经过参考模版训练(1053)之后,最终确定出参考模版(1054)。 After training after the reference template (1053), and ultimately determine the reference template (1054).

[0024] 由于建库需要使用大量人的声音样本进行训练,必然会带来大量的运算,而且声音样本还应具有多样性和完备性来克服说话人随机性和不稳定性,保证各种情形下语音识别正确率。 [0024] Since building a database requires the use of a large number of samples of human voice training, is bound to bring a lot of computation, and voice samples should have to overcome diversity and completeness speaker randomness and instability, to ensure a variety of situations speech recognition accuracy under. 这就决定了建库是个漫长的时间,需要积累一定样本才能实现。 This determines the library building is a long time, need to accumulate a certain sample can be achieved. 传统的训练算法为了能使训练模型能反映出新增数据特征,直接的处理方法就是将新旧数据一起重新训练,完全否定了先前训练的工作量,抛弃了已经训练好的HMM模型,伴随着一次又一次的数据扩充,必然会给建库带来沉重的负担。 Traditional training algorithms in order to make the training model can reflect the new data features, straightforward approach is to retrain the old and new data together, a complete denial of the previous training workload has abandoned the trained HMM model, accompanied by a Once again the data expansion, building a database is bound to bring a heavy burden. 为便于日后的数据扩充,也为了本发明提供的监控方法适用于不同的监控场合便于更新参考模板(1054),本发明使用模型合并重估的方法解决上述问题。 In order to facilitate future expansion of data, and to provide a monitoring method of the present invention is applicable to different monitoring situations facilitates updating reference template (1054), using the combined model of the present invention to solve the above problems revaluation. 原理如下: Principle is as follows:

假设训练数据仏为原始数据集,数据集A1经过Baum-Welch算法训练得出的HMM 模型函数为而=(X釣,由初始状态概率(7Γ),状态转移概率矩阵(2 )和观测概率矩阵()三个参数决定决定。从训练过程可知,·^反映了仏的样本特征。如果又增加了一个训练数据集其模型函数为Λ 式均,同理矣反映了Dj的样本特征。分别计算两个模型的转移次数炉α®⑷与,矢量数Wrfsw与電ίΡ、、状态数目斯与她P、,通过分子分母分别相加来反映迭代后的新模型参数,即通各模型参数的算术运算来改变Baum-Welch算法的重估公式,使新的模型;T分别反映和的特征。则Baum-Welch算法的重估公式可以改写为 Fo is assumed that the training data as the original data, data sets A1 through the Baum-Welch algorithm of HMM models trained by the function = (X fishing, the initial state probability (7Γ), the state transition probability matrix (2) and an observation probability matrix () the decisions of the three parameters seen from the training process, * ^ Fo reflects the characteristics of the sample. If you added a training data set that model function Λ type are, carry on the same way reflects the sample characteristics Dj's. were calculated transfer times α®⑷ furnace with two models, the number of vector Wrfsw electrically ίΡ ,, the number of states Si and P ,, her new model parameters are the sum of the numerator and denominator is reflected by the iterative, i.e. on the parameters of each model arithmetic revaluation operation to change the formula of the Baum-Welch algorithm, so that the new model; T, respectively, and reflect the features of the Baum-Welch algorithm revaluation formula can be rewritten as.

Figure CN102014278AD00091

显然,使用这种方法时,建立的HMM模型不仅需要保存』=C«;45)的参数,还应 Obviously, this method is used, HMM model is not only the need to preserve "= C«; 45) of the parameters, but also

保存相应的转移次数im®、矢量数和状态数目,共6个参数。 Save the corresponding number of transfer im®, the number of states and the number of vectors, a total of six parameters. 这样就能使HMM具有良好的扩充性和自学性,对增加新的训练数据,通过这种方式最后产生的模型就能反映这些新增数据信息。 This enables the HMM has good scalability and self-resistance, to add new training data, the last model produced in this way will be able to reflect these new data. 模型参数合并重估的方法大大的减少了新数据增加时模型训练的运算量,而且合并之后并不太影响识别率,改善了传统模型训练中一旦更换词表就要求采集大量数据重新训练的缺点,大大提升了新型安防监控系统的使用灵活性和实用性。 The combined model parameters revaluation method greatly reduces the amount of calculation increases when the new data model training, and then combined and do not affect the recognition rate, improving disadvantages of the conventional model training vocabulary once the replacement requires a large amount of data collected retraining , greatly enhance the flexibility of use and practicality of new security monitoring system.

[0025] 语音库录入部分的参考模版(1054)由每个录音者的语音样本数据分别训练,得到多个参考模板,每个参考模版是多个词汇的HMM集合,最后通过模型合并重估的方法将多个参考模板合并成一个,既减少了新增数据的训练运算量又达到非特定人识别的目的。 [0025] Reference stencil (1054) training speech database entry portion from the voice sample data of each person are recorded, to obtain a plurality of reference templates, each reference template is a set of a plurality of words HMM, and finally combined by a model revaluation methods combined into a plurality of reference templates, reducing the amount of computation new training data and achieve non-specific recognition. 语音库录入部分工作完成后,语音库建立工作完成。 After the speech database entry part of the work is completed, the library set up voice work is completed.

[0026] 系统工作时,如图1所示,摄像装置(101)将对监控场景进行实时摄像,所拍摄到的模拟信号通过相应的信道传输至上述的视频信号编码模块(102),视频信号编码模块(102)首先将模拟信号转化为数字信号,编码过后会将视频数据转变格式。 Analog signal [0026] The system works, as shown, the imaging device (101) will be monitored real time imaging a scene, captured corresponding to the transmission through the channel of the above-described video signal coding module (102), a video signal coding module (102) the first analog signal into a digital signal, the video will be encoded after the data format conversion. 视频数据通过局域网或专网(301)传至值监控室,一方面会将描述远程场景的视频数据在监控画面显示器(103)上显示,如果需要仔细分析,监控画面会被调转到主监控屏幕(401)上;另一方面会通过微型计算机把格式转换后的视频数据存储在硬盘上备份。 Video data through a LAN or private network (301) is transmitted to the control room values, on the one hand will describe a remote scene of video data on the monitor screen of the display (103) shows that if required a careful analysis, the monitor screen will be transferred to the main monitor screen (401); a hand to be backed up on the hard disk by the microcomputer stores the video data format conversion. 与此同时,声音采集装置(104)会在摄像装置(101)工作地同时实时的将语音数据采入系统,此时的语音信号仍然是模拟信号。 Meanwhile, the sound collection means (104) simultaneously in real-time voice data taken into the system in an imaging apparatus (101) work, the speech signal is still at this time is an analog signal. 模拟语音信号通过与摄像装置(101)采集的模拟视频信号不同的线路进行传输,传输至语音处理识别模块(105)。 Analog voice signal is transmitted through different analog video signal with image pickup means (101) collection line, to transmit voice recognition processing module (105).

[0027] 此时在语音处理识别模块(105)工作的部分是模式识别部分,如图2所示,模式识别部分的预处理特征提取(1056)功能与语音库录入部分中的预处理特征提取(1052) 一样,将声音采集装置(104)采集的音频信号经过采样(1052A)、量化(1052B)、分帧(IO52C)、加窗(IO52D)、预加重(IO52E)、端点检测(IO52F)、特征提取(1052G) >倒谱均减(1052H)处理,处理之后送往模版匹配(1057)与语音库录入部分中的参考模版(1054)进行匹配运算,匹配过后由语音识别(1058)选取与参考模版最接近的作为识别结果。 [0027] At this time section (105) operating in the speech recognition module pattern recognition processing section 16, the preprocessed feature pattern recognition preprocessing feature extraction section (1056) Function and speech database entry portion extracted 2 (1052), like the sound of the audio signal means (104) acquired collected through the sample (1052A), quantization (1052B), sub-frame (IO52C), windowing (IO52D), pre-emphasis (IO52E), endpoint detection (IO52F) , feature extraction (1052G)> cepstral average Save (1052H) processing, sent to matched template (1057) and the reference template (1054) speech database entry portion of the processing performed after the matching operation, after the match (1058) selected by the voice recognition closest to the reference template as a recognition result.

[0028] 识别算法通常采用经典的Viterbi算法,它不仅可以找到一条“最优”的状态路径,还可以得到该路径对应的输出概率。 [0028] recognition algorithms usually classical Viterbi algorithm, it can not only find a "best" path state, but also to obtain an output probability corresponding to the path. 在语音识别中,Viterbi算法解决给定一个观察值序列(提取的语音特征)以及模型参数(训练的语音关键词模型),如何确定一个最佳状态序列的问题。 In speech recognition, Viterbi algorithm to solve a given sequence of observations (extracted voice feature), and the model parameters (keywords trained speech models), how to determine the optimal state sequence of a problem. 得出各个训练模型最佳序列的输出概率值后,选取概率最大的模型作为识别结果。 After obtaining the optimal sequence output probability value for each training model, selecting the largest model probability as a recognition result. 由于任何语音特征数据经过模型后都会产生一个最佳序列,同时也会得出其对应的输出概率,所以场景中的各种语音数据与语音库匹配后都会有一个识别结果, 因此语音识别引擎之后必须用拒识算法忽略这些非相关的语音数据的影响,预警判别模块(106)解决了上述问题。 After the speech feature data for any model, after optimal will have a sequence, but also the corresponding output probability obtained, so that the various voice data and voice database to match the scene will have a recognition result, so the speech recognition engine We must refuse recognition algorithm to ignore the impact of these non-voice data related to the early warning identification module (106) solves the problem.

[0029] 经过语音处理识别模块(105)中的一系列处理后,将声音采集装置(104)采集监控场景中语音信号的识别结果送往预警判别模块(106),若此语音数据为一些标志性的预警词汇,例如“救命” “打架了”等,预警判别模块(106)会驱动预警指示装置(107) 并产生预警控制信号触发画面切换装置(501)。 [0029] After processing the speech recognition module series of processing (105), the sound collection means (104) monitoring the scene recognition result acquired speech signal sent to alarm determination module (106), if this is the number of voice data flag of early warning word, such as "help", "fight" and the like, warning determination module (106) drives the alert indicating means (107) and generating a control signal to trigger warning screen switching means (501). 然后,画面切换装置(501)会将监控画面显示器(103)中的监控画面切换到主监控屏幕(401)上;若此语音数据不是预警词汇,则预警判别模块(106)将屏蔽此次的识别结果,各个监控屏幕将按原有方式正常工作,语音处理识别模块(105)继续处理下一段采入的语音数据。 Then, the picture switching means (501) will monitor screen of the display (103) of the monitor screen is switched to the main monitor screen (401); if this is not the voice warning data word, the alarm determination module (106) of the shield the recognition result, the monitor screen of each mode will be the original work, voice recognition processing module (105) continues to process the voice data collection period.

[0030] 本设计的判别算法使用基于反词模型的拒识方法,实际系统中需对每个敏感词汇模型都训练相应的反词模型,反词模型主要用其它与关键词极易混淆的语音数据训练而成,反词模型具有和敏感词汇模型相同的HMM结构,如相同的状态数目和每个状态 [0030] The design of the discrimination algorithm model based on the use of anti-rejection word method, the actual system to be sensitive to each model are trained vocabulary word corresponding inverse model, inverse model is mainly used word keywords easily confused with other voice the number of states from the training data, and sensitive anti-word model having a model of the same word HMM structure, and the same as each state

Figure CN102014278AD00101

的混合数。 Mixed number. 根据Newman-Pearson准则,对Hq和Zf1两个假设,如果给出两个概率分布PiP I丑0)和P(01 H1),则通过判断 The Newman-Pearson criterion, and Zf1 Hq two assumptions, if the two probability distributions are given ugly PiP I 0) and P (01 H1), is determined by

来评价识别结果的可信程度(Tm为门限值)。 To evaluate the credibility of the recognition result (Tm is the threshold value). 本系统中将给定识别模型的观测数据的似然度作为汽的估值,将给定某种具有普遍意义的相反模型(反词模型)的观测数据的似然度作为汽0丨甩)的估值。 This system estimates the observed data given the model identification as a vapor likelihood, given a certain contrast model (word model trans) having a general significance of the observed data likelihood as 0 Shu vapor rejection) valuations. 计算Hut(Ojpff1)的值,如果超过预定门限值,则认为识别结果正确,这时由预警装置驱动报警器,并产生预警信控制号触发监控画面切换;反之则忽略。 Calcd Hut (Ojpff1), and if it exceeds a predetermined threshold value, the recognition result is considered correct, then the warning alarm drive means, and generates a warning control channel number monitor screen trigger switch; otherwise ignored.

[0031] 为了更清楚描述语音通道工作流程,本发明给出了其工作流程图如图4所示, 具体步骤如下: [0031] To more clearly describe the speech channel workflow, the present invention gives its work flow chart shown in Figure 4, the following steps:

步骤SlOl :完成语音模板库建立,系统开始工作; 步骤S102:初始化录音装置; Step SlOl: complete voice template library is established, the system begins to work; step S102: initializing a recording device;

步骤S103:采集监控场景中语音,进行采样、量化、分帧处理; 步骤S104 :端点检测判别是否能检测到一段完整语音并给出起始帧和结束帧,如果已经检测到完整语音段则进行下一步,否则继续执行步骤S103 ; 步骤S105:检测到完整语音后,设备停止录音; 步骤S106 :对的到得语音数据提取特征并做CMS处理; 步骤S107:用特征向量与语音模版库中模板进行匹配运算,得出语音识别结 Step S103: voice acquisition monitoring scenario, sampling, quantization, sub-frame processing; Step S104: determining whether the endpoint detection section can detect and give full speech starting frame and the ending frame, if a complete speech segment has been detected is performed the next step, otherwise, proceed to step S103; step S105: Upon detection of a complete voice, stop recording equipment; step S106: right to have a voice and data extraction features make CMS processing; step S107: a feature vector template speech template library matching operation, the speech recognition results junction

果; fruit;

步骤S108:用拒识算法对识别结果进行判别,判别结果若为语音模版库中包含的词汇,则进行一下步骤,否则返回到步骤S102,进行一下次检测; Step S108: a rejection algorithm to discriminate on the recognition result, the judgment result is vocabulary speech template contained in the library, then conduct some steps, otherwise it returns to step S102, be a next time detection;

步骤S109:预警模块进行报警实现主动预警功能,并产预警控制信号; 步骤SllO :预警控制信号触发画面切换装置,主监控屏幕画面自动切换,切换功能完成后返回步骤S102继续工作,直至系统停止运行。 Step S109: warning module alarm for active alarm function, and produces warning control signal; Step SllO: warning control signal is triggered after the screen switching means, the main monitor screen automatic switching, switching completion returns to step S102 to continue to work until the system is stopped .

[0032] 下面结合实例对本发明所提供的监控方法做进一步的说明。 [0032] The following examples further described in conjunction with the monitoring method of the present invention is provided. 如图5,图6所示, 本发明应用于垂直升降梯的监控领域。 As shown in FIG. 5, FIG. 6, the present invention is used to monitor the field of vertical lift. 在此我们选取众多监控场景中的两个,对本发明所提供的监控方法进行阐述。 Here we select two of many monitoring the scene, the monitoring method of the present invention are set forth to provide. 如图5所示的场景一、图6所示的场景二均反映了实际中的垂直升降梯工作情况。 Scenario shown in FIG. 5 shown in a scene, FIG. 6 two vertical lifts are reflected in the actual work. 在场景一中布置了摄像装置(101)和声音采集装置(104);场景二中布置了摄像装置(201)和声音采集装置(204),系统工作以前,我们用呼救相关的一类语音词汇建立语音模版库,包括“救命”、“救人”、“来人啊”等。 Are arranged in the scene one of the imaging means (101) and the sound collection means (104); Scene II is arranged an imaging device (201) and the sound collection means (204), the system previous work, we used distress associated a class of speech vocabulary establishing a voice template library, including "help", "rescue", "come ah" and so on.

[0033] 在场景一中,垂直升降梯正常工作,乘坐人员进行日常交流。 [0033] In a scenario, the vertical lift work, the occupant daily communication. 摄像装置(101) 对场景一实时摄像,将所拍摄到的视频信息通过相应的信道传输至上述的视频信号编码模块(102)进行编码,将视频信息流转换格式。 Imaging means (101) for real-time imaging a scene, the captured video information to encode a corresponding channel transmission signal via the video coding module (102) to the conversion format video stream. 格式转换后视频数据通过局域网或专网(301)传至值监控室,一方面会将场景一的视频画面在监控画面显示器(103)上显示; 另一方面会通过微型计算机把格式转换后的视频数据存储在硬盘上备份。 After the video data format conversion through a LAN or private network (301) value transmitted control room, a video scene on the one hand will be displayed on the monitor screen of the display screen (103); on the other hand will be the format conversion by the microcomputer video store backup data on the hard disk. 声音采集装置(104)会在摄像装置(101)工作的同时将此刻间的语音数据采入系统,语音信号通过与摄像机使用的不同的线路进行传输,传输至语音处理识别模块(105)。 While the sound collection means (104) will operate in the imaging means (101) between the speech data taken into the system at the moment, the speech signal is transmitted through a different line and use the camera, it is transmitted to the voice recognition processing module (105). 经过语音识别处理模块(105)中的一系列处理后,将声音采集装置(104)采集的监控场景一的语音信号的识别结果送往预警判别模块(106)。 A recognition result of the speech signal through the speech recognition processing scene monitoring module workup, the sound collection (105) a series of means (104) collecting a warning sent to decision block (106). 经判决器采用拒识判别后,认为此语音数据并不含有语音库中标志性的预警词汇,所以预警判别模块(106)会屏蔽此次预警,且不会驱动预警指示装置(107)并发送预警控制信号。 Adjudged uses the rejection determination that this is not the voice data contained in the speech database vocabulary iconic alert, the alarm determination module (106) will block the warning, warning indication without drive means (107) and transmitted warning signal control. 此时说明监控的垂直升降梯场景一中一切正常,监控人员可正常工作。 At this time, the vertical lift described monitoring a scene in all normal, monitoring personnel can work properly.

[0034] 在场景二的垂直升降梯中,垂直升降梯出现故障,某乘坐人员呼救。 [0034] In the scene two vertical lift, the vertical lift failure, an occupant for help. 摄像装置(201)对此情此景摄像,将所拍摄到的视频信息传输至视频信号编码模块(202)同样进行编码,将数据转换格式。 Imaging means (201) for imaging a scene, the captured video information to the video signal encoding module (202) encodes the same, the data format conversion. 视频数据通过局域网或专网(301)传至值监控室,在监控画面显示器(203)上显示并在硬盘上存储;在没有主动提示的情况下,监控人员必须保持高度的注意力才能从众多监控屏幕中发现这一事故并采取相关措施处理。 Video data through a LAN or private network (301) transmitted to the control room value, displayed and stored on the hard disk on the screen of the display monitor (203); in the absence of active prompt, the monitoring must maintain a high degree of attention to a number from monitor screen in the discovery of this incident and to take relevant measures to deal with. 此间,声音采集装置(204)会在摄像装置(201)工作的同时将此刻间的求救语音数据采入系统,传输至语音处理识别模块(205)。 At the same time here, sound collection means (204) can (201) working in the imaging apparatus will help the voice data taken into the system between the moment, transmitted to the speech recognition processing module (205). 经过语音处理识别模块(205)中的数据处理后,将声音采集装置(204)采集的监控场景二垂直升降梯中的求救语音信号的识别结果送往预警判别模块(206)。 After the speech recognition processing data processing module (205) of the sound collection means (204) the speech recognition result distress signal acquisition monitoring scenario where two of the vertical lift taken alarm determination module (206). 经预警判决模块(206)采用拒识算法判别后,判别此次求救语音数据包含在事先建立的语音库中,属于预警词汇,所以预警判别模块(206)会对求救语音进行预警,且发送预警控制信号驱动预警指示装置(207)并触发换面切换装置(501)。 After the (206) using discriminant algorithm rejection decision by the warning module, help distinguish the voice data contained in previously established speech database belonging warning vocabulary, so the warning determination module (206) will help speech warning, and sent warning driving control signal indicative of alarm means (207) and triggers switching means plane (501). 通过预警指示装置(207)的提示,监控人员可知场景二中垂直升降梯了发生事故。 Prompt, the monitoring personnel through early warning indication means (207) is seen in two vertical lifts the accident scene. 同时,画面切换装置(501)会将监控画面显示器(203)中场景二的监控画面切换到主监控屏幕(401)上供监控人员进一步分析和处理。 Meanwhile, the picture switching means (501) will monitor screen of the display (203) in the two scenes monitor screen is switched to the main monitor screen (401) for reference by monitoring personnel further analysis and processing.

[0035] 本发明所提供的监控方法使用范围除了例中提到的垂直升降梯以外,还可以用在监狱、停车场,楼道等场景中,也可在晚间作为一种辅助的监控手段对某场景或区域进行监控。 [0035] The use of monitoring method of the present invention is provided in addition to the vertical lift mentioned embodiment, may also be used in prisons, parking garages, and other scenes corridor, at night can be monitored as an auxiliary means of a scene or area to be monitored.

[0036] 本发明提供的监控方法,利用音频信息的主动预警功能弥补了监控设备视角范围限制以及视频监控受到光照、天气等自然环境的影响。 Monitoring method [0036] The present invention provides the use of audio information made up for the function of the active warning devices to monitor and limit the angle range of illumination by the video surveillance, weather and natural environment. 预警控制信号实现的监控画面智能切换功能解决了人眼检测存在的问题,避免了监控人员长时间专注多个屏幕而导致注意力下降,大大提高监控效率,其合理的架构设计非常便于传统视频监控系统设备更新和升级,其独立的语音模版库可使系统以应用于多个场合,其人性化的监控方法大大提升视频监控的效率,也减少雇佣大批监视人员所需要的人力、物力和财力的投入。 Warning Signals for the monitor screen intelligent switch function to solve the problems of the human eye to detect problems and avoid the monitoring personnel to focus more screen time and lead to decreased attention, greatly improving the efficiency of monitoring, its rational architectural design is very easy traditional video surveillance system equipment updates and upgrades, its independent voice template library system can be applied to multiple occasions, its user-friendly method of monitoring greatly enhance the efficiency of video surveillance, but also reduce manpower, material and financial resources to hire a large number of personnel needed to monitor investment.

Claims (7)

  1. 1. 一种基于语音识别技术的智能视频监控方法:其特征在于将语音识别技术作为辅助的视频监控手段引入到视频监控中,视频信号和音频信号进行独立处理,用语音识别技术处理监控场景中的语音数据,对其中敏感词汇进行预警,来实现监控系统主动预警,并由预警信号触发监控画面自动切换,从而可以使监控系统具备主动预警、监控画面智能切换功能,包括如下6个循环步骤:(1)事先建立语音模板库,针对监控场景中出现的敏感词汇集中建立语音模版库,需要录音人数在几十人左右,采集多个男声和女声的语音样本作为训练数据,训练数据的内容可根据不同的监控场景选择不同的词汇,训练数据录制分阶段进行,训练语料内容基本为孤立词和短句子;(2)建库完毕后,系统开始工作,利用声音采集装置和摄像装置分别采集监控场景的视频信息和语音 An intelligent video surveillance based on speech recognition technology: wherein voice recognition technology into the video monitor as an auxiliary means for video surveillance, video signals and audio signals are processed independently, treated with speech recognition technology to monitor the scene voice data, wherein the early warning sensitive words, active warning monitoring system implemented by the warning signal triggers automatically switch the monitor screen, thereby warning that the monitoring system comprises active, intelligent switching the monitor screen, comprising the steps 6 cycles: (1) the prior establishment of a voice template library, focus on building vocabulary speech template library for sensitive monitoring the scene occurs, you need to record the number in the tens of people, collect voice samples multiple male and female voices as the training data, the content of the training data can be according to choose different monitoring scenarios different words, the training data recorded in phases, the basic content for the training corpus isolated words and short sentences; (2) building a database is complete, the system begins to work, the use of sound device and an imaging device for collecting monitoring respectively video and voice scene 信息;(3)对视频信号进行编码并进行格式转换,将处理后的视频数据通过专网或局域网送至监控室显示并且保存,语音信息则送往语音处理识别模块经过一系列处理后进行语音识别;(4)将语音识别的结果送到预警判别模块进行判别,针对场景中出现的、语音库中未包括的正常语音信息会予以排除,对监控场景中出现的、并在语音库中包含的敏感词汇则由预警模块产生预警控制信号并报警,实现主动预警功能;(5)预警控制信号触发画面切换装置,主监控屏幕画面切换为与发出预警控制信号相匹配的监控画面以供监控人员分析,实现监控画面智能切换功能;(6)完成以上检测后,重复步骤(2)到(5),进行下一次检测。 Information; (3) the video signal encoding and format conversion, the video data to the monitor display processing chamber via the private network or a local area network and storage, voice information is sent to the voice processing module through a series of voice recognition post-processing recognition; (4) the speech recognition results to the warning determination module discrimination, appearing for the scene, the voice library not included in a normal voice message will be excluded, monitoring the scene appeared, and included in the speech database sensitive vocabulary module generates the warning by the warning alarm and control signals, for active warning function; (5) a control signal to trigger warning screen switching means, the main monitor screen is switched to the signal matches the monitor screen and control for early warning monitoring person analysis, switching function to achieve intelligent monitor screen; (6) after completion of the above test, repeat steps (2) to (5), for next detection.
  2. 2.如权利要求1所述的监控方法,其特征在于:语音模版库是开放式的,其中的敏感词汇内容可根据监控需要进行制定,分阶段录音是针对语音的动态范围很大,保证库中尽量包含多样的语音信息,保持高的语音识别率;语音模版库先由每个录音者的语音样本数据分别训练,得到多个参考模板,每个参考模版是多个词汇的隐马尔科夫模型(hidden Markov model, HMM)集合,最后通过模型合并重估的方法将多个参考模板合并成一个,完成建库,既减少了新增数据的训练运算量又达到非特定人识别的目的。 2. The monitoring method according to claim 1, wherein: a voice template library is open, the sensitive words which may be formulated according to the content required monitoring stages for recording a large dynamic range of speech, to ensure that the library try to include a variety of voice information, to maintain a high speech recognition rate; speech template library first by recording the voice sample data of each person were trained to obtain a plurality of reference templates, each reference template is a plurality of hidden Markov word models (hidden Markov model, HMM) set by the last revaluation model-merger will be merged into a plurality of reference templates to complete the construction of the library, not only reduces the amount of training operation of the new data but also to achieve the purpose of non-specific recognition.
  3. 3.如权利要求1所述的监控方法,其特征在于:建立的HMM模型不仅包括初始状态概率、状态转移概率矩阵、观测概率矩阵三个参数,还包括状态转移次数、状态输出矢量数和状态数目共6个参数,其中后三个参数是为了应用模型合并重估方法而设置的。 3. The monitoring method according to claim 1, wherein: HMM model comprises not only the initial state probability of the state transition probability matrix, the observation probability matrix three parameters, further comprising a number of state transition, the state number and the state of the output vector the number of a total of six parameters, three parameter model for revaluation of combined application of the method provided.
  4. 4.如权利要求1所述的监控方法,其特征在于:语音处理识别模块中对监控场景中的语音信息处理顺序为采样、量化、分帧、加窗、预加重、端点检测、提取语音特征、 倒谱均减(cepstral mean subtraction, CMS)后进行语音识别,语音特征提取使用倒谱均减的方法进行处理克服了传输设备线路不同而导致的信道畸变对语音识别的影响,在此使用倒谱均减的方法解决数据训练环境与实际使用环境中信道的不匹配导致识别率下降的影响。 4. The monitoring method according to claim 1, wherein: a voice recognition processing module for processing speech information in order to monitor the scene sampling, quantization, sub-frame, windowing, pre-emphasis, endpoint detection, speech feature extraction speech recognition, the cepstral are Save (cepstral mean subtraction, CMS), speech feature extraction using the method of cepstral average subtraction processes to overcome the effects of channel distortion of the transmission device different lines result of voice recognition, as used herein, inverted spectral subtraction methods are training environment resolution data and the actual use environment of mismatch causes the channel affect the recognition rate.
  5. 5.如权利要求1所述的监控方法,其特征在于:所述预警判别模块采用基于反词模型的拒识方法对语音识别结果进行判别,实际系统中需对每个关键词HMM模型都训练相应的反词模型,反词模型主要用其它与关键词极易混淆的语音数据训练而成,反词模型具有和关键词HMM模型相同的结构,如相同的状态数目和每个状态的混合数。 5. The monitoring method according to claim 1, wherein: said alarm determination module is to discriminate the speech recognition result word model anti-rejection method based on the actual system required for each keyword HMM models trained anti respective word model, primarily anti-word model with the other easily confused with the keyword from the speech training data, and a counter-word model having the same keyword HMM model structure, such as a mixture of the same number of states and each state number .
  6. 6.如权利要求1所述的监控方法,其特征在于:主动预警功能针对监控场景中出现的、并在语音库中包含的敏感词汇进行预警,而对场景中出现的、语音库中未包括的正常语音信息会予以排除,当监控场景发生事故时,当事人通过语音信息可以直接对监控中心工作人员发出预警。 6. The monitoring method according to claim 1, wherein: the active alarm function occurs for monitoring the scene, and in the speech database containing sensitive words early warning, while appearing in the scene, is not included in the speech database normal voice messages will be excluded, while monitoring the scene of accidents, the parties may provide early warning through voice messages directly to the monitoring center staff.
  7. 7.如权利要求1所述的监控方法,其特征在于:预警判别模块判别某语音段为库中敏感词汇后会驱动预警指示装置报警,同时发出预警控制信号触发画面切换装置,主监控屏幕画面切换为与发出预警控制信号相匹配的监控画面,实现监控画面智能切换功能。 Main monitor screen warning determination module determines a speech segment database will drive the warning indicator or alarm means sensitive words, while a control signal to trigger early warning screen switching means: monitoring method as claimed in claim 1, characterized in that switching of the switching function and early warning control signal matches the monitor screen, the monitor screen to achieve intelligent.
CN 201010598197 2010-12-21 2010-12-21 Intelligent video monitoring method based on voice recognition technology CN102014278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010598197 CN102014278A (en) 2010-12-21 2010-12-21 Intelligent video monitoring method based on voice recognition technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010598197 CN102014278A (en) 2010-12-21 2010-12-21 Intelligent video monitoring method based on voice recognition technology

Publications (1)

Publication Number Publication Date
CN102014278A true true CN102014278A (en) 2011-04-13

Family

ID=43844266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010598197 CN102014278A (en) 2010-12-21 2010-12-21 Intelligent video monitoring method based on voice recognition technology

Country Status (1)

Country Link
CN (1) CN102014278A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547248A (en) * 2012-02-03 2012-07-04 深圳锐取信息技术股份有限公司 Multi-channel real-time monitoring single-video-file recording method
CN104144328A (en) * 2014-07-31 2014-11-12 中国人民解放军63908部队 Intelligent video monitoring method
CN104184212A (en) * 2014-09-10 2014-12-03 国家电网公司 Remote monitoring system for communication machine room of transformer substation
CN104239046A (en) * 2014-09-05 2014-12-24 河海大学 Software self-adapting method based on HMM (hidden Markov model) and MOEA (multi-objective evolutionary algorithm)
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN104795073A (en) * 2015-03-26 2015-07-22 无锡天脉聚源传媒科技有限公司 Method and device for processing audio data
CN105225420A (en) * 2015-09-30 2016-01-06 中国民用航空总局第二研究所 Air traffic control staff fatigue detection method based on principle component analysis, device and system
CN105354830A (en) * 2015-09-30 2016-02-24 中国民用航空总局第二研究所 Method, apparatus and system for controller fatigue detection based on multiple regression model
WO2016026446A1 (en) * 2014-08-19 2016-02-25 北京奇虎科技有限公司 Implementation method for intelligent image pick-up system, intelligent image pick-up system and network camera
CN105391708A (en) * 2015-11-02 2016-03-09 北京锐安科技有限公司 Audio data detection method and device
CN107222705A (en) * 2017-05-27 2017-09-29 山东中磁视讯股份有限公司 Conversation recording system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2402082Y (en) * 2000-01-08 2000-10-18 张开明 Domestic automatic safety video monitor
US6704707B2 (en) * 2001-03-14 2004-03-09 Intel Corporation Method for automatically and dynamically switching between speech technologies
DE10257473A1 (en) * 2002-12-09 2004-07-08 Infineon Technologies Ag Computer-based method for expansion of an electronic dictionary for use in speech recognition using hidden Markov model distances for determining if a word is an out of vocabulary value or already exists
CN1835583A (en) * 2005-03-18 2006-09-20 北京富星创业科技发展有限公司 Security monitoring management system and its working method
CN201203916Y (en) * 2008-04-23 2009-03-04 上海承兴实业有限公司 Multimedia ganged safety prevention alarm system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2402082Y (en) * 2000-01-08 2000-10-18 张开明 Domestic automatic safety video monitor
US6704707B2 (en) * 2001-03-14 2004-03-09 Intel Corporation Method for automatically and dynamically switching between speech technologies
DE10257473A1 (en) * 2002-12-09 2004-07-08 Infineon Technologies Ag Computer-based method for expansion of an electronic dictionary for use in speech recognition using hidden Markov model distances for determining if a word is an out of vocabulary value or already exists
CN1835583A (en) * 2005-03-18 2006-09-20 北京富星创业科技发展有限公司 Security monitoring management system and its working method
CN201203916Y (en) * 2008-04-23 2009-03-04 上海承兴实业有限公司 Multimedia ganged safety prevention alarm system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《中国优秀硕士学位论文全文数据库 信息科技辑》 20100415 刘勇 语音识别技术在安防监控系统中的应用 第16-19,25-28,43-50页 1-7 , 2 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547248A (en) * 2012-02-03 2012-07-04 深圳锐取信息技术股份有限公司 Multi-channel real-time monitoring single-video-file recording method
CN104144328A (en) * 2014-07-31 2014-11-12 中国人民解放军63908部队 Intelligent video monitoring method
CN104144328B (en) * 2014-07-31 2017-06-16 中国人民解放军63908部队 An intelligent video surveillance method
CN105407316A (en) * 2014-08-19 2016-03-16 北京奇虎科技有限公司 Implementation method for intelligent camera system, intelligent camera system, and network camera
WO2016026446A1 (en) * 2014-08-19 2016-02-25 北京奇虎科技有限公司 Implementation method for intelligent image pick-up system, intelligent image pick-up system and network camera
CN104239046B (en) * 2014-09-05 2017-07-18 河海大学 Adaptive software method hidden Markov model and multi-objective evolutionary algorithm
CN104239046A (en) * 2014-09-05 2014-12-24 河海大学 Software self-adapting method based on HMM (hidden Markov model) and MOEA (multi-objective evolutionary algorithm)
CN104184212B (en) * 2014-09-10 2016-08-31 国网冀北电力有限公司廊坊供电公司 Remote monitoring system for substation communications room
CN104184212A (en) * 2014-09-10 2014-12-03 国家电网公司 Remote monitoring system for communication machine room of transformer substation
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN104505090B (en) * 2014-12-15 2017-11-14 北京国双科技有限公司 Sensitive to the word speech recognition method and apparatus
CN104795073A (en) * 2015-03-26 2015-07-22 无锡天脉聚源传媒科技有限公司 Method and device for processing audio data
CN105225420A (en) * 2015-09-30 2016-01-06 中国民用航空总局第二研究所 Air traffic control staff fatigue detection method based on principle component analysis, device and system
CN105354830A (en) * 2015-09-30 2016-02-24 中国民用航空总局第二研究所 Method, apparatus and system for controller fatigue detection based on multiple regression model
CN105391708A (en) * 2015-11-02 2016-03-09 北京锐安科技有限公司 Audio data detection method and device
CN107222705A (en) * 2017-05-27 2017-09-29 山东中磁视讯股份有限公司 Conversation recording system

Similar Documents

Publication Publication Date Title
Oliver et al. Layered representations for learning and inferring office activity from multiple sensory channels
Lu et al. Speakersense: Energy efficient unobtrusive speaker identification on mobile phones
US20120303369A1 (en) Energy-Efficient Unobtrusive Identification of a Speaker
Istrate et al. Information extraction from sound for medical telemonitoring
CN101299812A (en) Method, system for analyzing, storing video as well as method, system for searching video
CN103198605A (en) Indoor emergent abnormal event alarm system
CN101404107A (en) Internet bar monitoring and warning system based on human face recognition technology
CN102324232A (en) Voiceprint identification method based on Gauss mixing model and system thereof
Stern et al. Hearing is believing: Biologically-inspired feature extraction for robust automatic speech recognition
CN101188743A (en) An intelligent digital system based on video and its processing method
Lu et al. Real-time unsupervised speaker change detection
Nordqvist et al. An efficient robust sound classification algorithm for hearing aids
US20060195316A1 (en) Voice detecting apparatus, automatic image pickup apparatus, and voice detecting method
CN102280106A (en) Voice network search method and apparatus for a mobile communication terminal,
CN101364408A (en) Sound image combined monitoring method and system
Fleury et al. Sound and speech detection and classification in a health smart home
US20100202670A1 (en) Context aware, multiple target image recognition
CN101753992A (en) Multi-mode intelligent monitoring system and method
Das et al. Recognition of isolated words using features based on LPC, MFCC, ZCR and STE, with neural network classifiers
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
CN102824092A (en) Intelligent gesture and voice control system of curtain and control method thereof
CN101345668A (en) Control method and apparatus for monitoring equipment
US20080215318A1 (en) Event recognition
CN102348101A (en) Examination room intelligence monitoring system and method thereof
Huang et al. Scream detection for home applications

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C02 Deemed withdrawal of patent application after publication (patent law 2001)