CN103700370B

CN103700370B - A kind of radio and television speech recognition system method and system

Info

Publication number: CN103700370B
Application number: CN201310648375.4A
Authority: CN
Inventors: 陈鑫玮
Original assignee: BEIJING PATTEK Co Ltd
Current assignee: BEIJING PATTEK Co Ltd
Priority date: 2013-12-04
Filing date: 2013-12-04
Publication date: 2016-08-17
Anticipated expiration: 2033-12-04
Also published as: CN103700370A

Abstract

The invention discloses a kind of radio and television audio recognition method and system, wherein method includes: extract voice data according to broadcast television data；Voice data is carried out pretreatment, obtains feature text data；Feature text data is sent to Cloud Server be identified processing, obtains men and women's sound identification, Speaker Identification and voice identification result；To data prediction, men and women's sound identification, Speaker Identification and voice identification result merges and structured text mark, the voice identification result of generating structure.Existing voice recognition methods is improved by the method, merge various broadcast television data preconditioning technique and radio and television audio recognition method, it is identified processing for the data handling requirements of broadcast television industry to speech data, carry out each recognition result merging and the voice identification result of generating structure, basic data can be provided for the intelligent processing method of other business of subsequent broadcast TV programme, and processing speed is accelerated and improves accuracy.

Description

A kind of radio and television speech recognition system method and system

Technical field

The present invention relates to audio frequency and video processing technology field, know particularly to a kind of radio and television voice Other method and system.

Background technology

At present in field of broadcast televisions, radio and television speech recognition is mainly utilized and is applicable to each row The conventional speech recognition methods of industry, and traditional speech recognition mainly uses pattern matching method, point For training and identifying two stages, wherein in the training stage, user is by each word in vocabulary Read successively or give an account of, and its characteristic vector is stored in template base as template；Identifying In the stage, the characteristic vector of input voice is carried out similarity with each template in template base successively Relatively, similarity soprano is exported as recognition result.

But this speech recognition application there is problems in that in the speech recognition of field of broadcast televisions

1) speech recognition is often had particularly, is different from other industry by broadcast television industry Process and operation, but owing to above-mentioned traditional voice identification is applied to every profession and trade, for extensively Broadcast TV industry and there is no specific aim, it is impossible to according to the feature of broadcast television industry to radio and television number Non-voice context according to filters.Because in broadcast television industry non-voice context for Speech recognition is not within process range, if so do not carried out non-voice context Filter, just also needs to be transmitted it and process, and does not only result in transfer resource and calculates resource Waste, but also can cause that more misrecognition operation occurs due to the existence of non-voice context, And affect processing speed.

2) do not possesses the speech recognition for broadcast television industry due to traditional voice identification technology Function, causes recognition result sufficiently complete, such as, cannot sentence for one section of broadcast television data Break and the important information such as identity of the speak scene occurred and speaker, it is impossible to voice content Segmentation is carried out, it is impossible to identify the timestamp of each voice word, to follow-up according to different speakers The intellectuality of other broadcast television services, automatic business processing cannot provide any valuable reference Information.

To sum up, traditional audio recognition method is applied to exist in broadcast television industry and is expended money The problems such as source, processing speed is slow, accuracy is the highest, offer quantity of information is not enough.

Summary of the invention

(1) to solve the technical problem that

The technical problem to be solved in the present invention is how to carry out language for broadcast television industry feature Sound identification, it is to avoid conventional speech recognition methods broadcast television industry apply present in shortcoming, Intellectuality, automatic business processing for other broadcast television industry business follow-up provide abundance can use Basic data.

(2) technical scheme

For solving above-mentioned technical problem, the invention provides a kind of radio and television speech recognition side Method, including:

S1, extract voice data according to broadcast television data；

S2, described voice data is carried out pretreatment, obtain feature text data；

S3, described feature text data is sent to Cloud Server be identified process, obtain man Female voice identification, Speaker Identification and voice identification result；

S4, to described data prediction, men and women's sound identification, Speaker Identification and speech recognition Result carries out merging and structured text mark, the voice identification result of generating structure.

Further, step S2 carries out pretreatment to described voice data and specifically includes:

S21, described voice data is carried out cutting and fragmentation process and generate several sentences literary compositions Part；

S22, described sentence file is carried out non-voice filtration, leave speech sentence file；

S23, each speech sentence file is carried out wide and narrow strip differentiation, to being determined as broadband signal Speech sentence file add broadband mark, it determines for narrow band signal speech sentence file add Arrowband identifies；

S24, the speech sentence file identifying interpolation broadband mark and arrowband carry out audio frequency characteristics Extract, obtain feature text data, wherein said feature text data comprises this speech sentence Beginning and ending time, voice characteristics information, the audio-video document title of this sentence ownership and correspondence Wide and narrow strip identifies.

Further, described feature text data is sent to Cloud Server and is identified by step S3 Process includes: men and women's sound identification, Speaker Identification, voice content identification and punctuation mark identification, Generate the voice identification result containing mark.

Further, institute's speech recognition result is merged and structured text by step S4 Mark specifically includes:

S41, each voice identification result is collected, aligns, and according to wherein comprising Beginning and ending time is ranked up；

S42, to sequence after voice identification result be marked according to structured format, including Speaker's sex mark, speaker's mark, voice content, punctuation mark and timestamp.

Further, the process that step S3 is identified processing is to know according to language model storehouse Other, and network text collection is passed through in described speech model storehouse and network text study is constantly carried out Update.

For solving above-mentioned technical problem, present invention also offers a kind of radio and television speech recognitions system System, this system includes:

Extraction unit, extracts voice data according to broadcast television data；

Pretreatment terminal, carries out pretreatment to described voice data, obtains feature text data, And it is sent to Cloud Server；

Cloud Server, is identified described feature text data processing, and obtains speech recognition knot Really, and institute's speech recognition result is merged and structured text mark, generating structure The voice identification result changed.

Further, described pretreatment terminal includes:

Cutting module, carries out cutting to described voice data and fragmentation processes and generates several sentences Subfile；

Non-voice filtering module, carries out non-voice filtration to described sentence file, leaves voice sentence Subfile；

Wide and narrow strip discrimination module, carries out wide and narrow strip differentiation, to differentiation to each speech sentence file Speech sentence file for broadband signal adds broadband mark, it determines for the voice sentence of narrow band signal Subfile adds arrowband mark；

Audio feature extraction module, to adding broadband mark and the speech sentence file of arrowband mark Carry out audio feature extraction, obtain feature text data, wherein said feature text data wraps Beginning and ending time containing this speech sentence, belong to the wide and narrow strip mark of audio-video document title and correspondence Know.

Further, described Cloud Server includes:

Men and women's sound identification module, for carrying out men and women's sound identification to described feature text data；

Speaker Identification module, for carrying out Speaker Identification to described feature text；

Voice content and punctuation mark identification module, for carrying out in voice described feature text Hold and identify and punctuation mark identification, generate the voice identification result containing punctuation mark mark；

Recognition result processing module, merges institute's speech recognition result and structuring literary composition This mark, the voice identification result of generating structure.

Further, described recognition result processing module farther includes:

Collect order module, for each voice identification result is collected, aligns, and press It is ranked up according to the beginning and ending time wherein comprised；

Add mark module, for the voice identification result after sequence is carried out according to structured format Labelling, including speaker's sex mark, speaker's mark, voice content, punctuation mark and Timestamp.

Further, described Cloud Server also includes: language model intellectual learning module, use In fixed-period crawling network text, by the study regular update language model storehouse to network text, In identification processing procedure, the language model storehouse according to regular update is identified.

(3) beneficial effect

Embodiments provide a kind of radio and television audio recognition method and system, Qi Zhongfang Method includes: extract voice data according to broadcast television data；Described voice data is carried out pre- Process, obtain feature text data；Described feature text data is sent to Cloud Server carry out Identifying processing, obtains men and women's sound identification, Speaker Identification and voice identification result；To described Data prediction, men and women's sound identification, Speaker Identification and voice identification result carry out merging with And structured text mark, the voice identification result of generating structure..The method is based on cloud meter Calculate and existing voice recognition methods is improved, merge broadcast television data preconditioning technique, man Female voice identification technology, speaker Recognition Technology and radio and television audio recognition method, to voice Data are specific to the data handling requirements of broadcast television industry again and know after carrying out pretreatment Other places are managed, to broadcast television data pre-processed results, men and women's sound recognition result, Speaker Identification Result and voice identification result carry out merging and structured text mark, generating structure Voice identification result, it is possible to for the speech retrieval of broadcast TV program, subtitle recognition, host The later stage intelligent processing method functions such as identification provide basic data, it is possible to radio and television voice is known Other processing speed is accelerated and improves accuracy.

Intellectuality, automatic business processing for other broadcast television services follow-up provide basic data tool Body include following some:

1) recognition result to voice and the mark result to voice word timestamp can be The retrieval service of radio and television voice content provides basic data；

2) the cutting time point to speech sentence identifies result, and the differentiation knot of wide and narrow strip Really, can be broadcast TV program split provide boundary time point reference；

3) to the identification of voice content in radio and television and the identification of punctuation mark, permissible Content reference is provided for the subtitle recognition in broadcast TV program；

4) to the Speaker Identification of speech sentence and the differentiation of wide and narrow strip as a result, it is possible to be Host in broadcast TV program identifies, welcome guest identifies, scene Recognition of speaking (indoor scene, Outdoor scene) etc. provide foundation.

Accompanying drawing explanation

The step of a kind of radio and television audio recognition method that Fig. 1 provides for the embodiment of the present invention one Flow chart；

The flow chart of steps of the pretreatment operation that Fig. 2 provides for the embodiment of the present invention one；

The speech/non-speech that Fig. 3 provides for the embodiment of the present invention one differentiate during audio classification The technological frame schematic diagram of method；

The tool that broadcast television data is carried out speech recognition that Fig. 4 provides for the embodiment of the present invention one Body flow chart；

The composition of a kind of radio and television speech recognition system that Fig. 5 provides for the embodiment of the present invention two Schematic diagram；

The composition schematic diagram of the pretreatment terminal that Fig. 6 provides for the embodiment of the present invention two；

The composition schematic diagram of the Cloud Server that Fig. 7 provides for the embodiment of the present invention two；

The voice content that Fig. 8 provides for the embodiment of the present invention two and the work of punctuation mark identification module Make flow chart；

The cloud service platform configuration diagram that Fig. 9 provides for the embodiment of the present invention two.

Detailed description of the invention

Below in conjunction with the accompanying drawings and embodiment, the detailed description of the invention of the present invention is made the most in detail Describe.Following example are used for illustrating the present invention, but are not limited to the scope of the present invention.

Embodiment one

The embodiment of the present invention one provides a kind of radio and television audio recognition method, and steps flow chart is such as Shown in Fig. 1, specifically include following steps:

Step S1, extract voice data according to broadcast television data.

Step S2, voice data is carried out pretreatment, obtain feature text data.

Step S3, feature text data is sent to Cloud Server be identified process, obtain man Female voice identification, Speaker Identification and voice identification result；

Step S4, to data prediction, men and women's sound identification, Speaker Identification and speech recognition Result carries out merging and structured text mark, the voice identification result of generating structure.

Broadcast television data to be identified (the i.e. audio frequency and video number that first said method provides from user According to) in extract voice data, and after pretreatment, obtain feature text data, then by cloud Server to its be identified process, finally to the data prediction obtained, men and women's sound identification, Speaker Identification and voice identification result carry out merging and structured text mark, the most throughout one's life Become structurized voice identification result, and return to use with expandable mark language XML by it Family.To voice identification result add the timestamp of voice word, the timestamp of sentence, Nan Nvsheng, Speakers etc. identify, it is possible to for retrieval, subtitle recognition and the hosting of radio and television voice content People's identification etc. provide foundation, the intellectuality of convenient other broadcast television services follow-up, automatically Change processes, and provides basic data for various operations and process.

Preferably, also include before the present embodiment step S1: receive the radio and television that user sends Data, wherein this broadcast television data includes audio, video data it can be understood as voice data And video data.After receiving broadcast television data, this broadcast television data to be first determined whether It is whether the audio, video data type supported of speech recognition system, if not supporting in other words Discernible audio, video data, then refusal processes.

Audio/video decoding in the present embodiment uses encoding and decoding standard G.711, utilizes ffmpeg Software decode instrument realizes the decoding of audio frequency and video, and extraction audio-frequency unit saves as pcm form, The radio and television audio, video data form of compatible current various main flows, such as wmv, wma, The form such as wav, mp3, asf, rm, mp4, avi, flv.If it is judged that be discernible Audio, video data, then be decoded this audio, video data, and the most therefrom extract and belong to The data of audio-frequency unit, and using the voice data that obtains as the pending data of step S2.

Preferably, step S2 in the present embodiment carries out pretreatment to voice data, in pretreatment Hold and mainly include carrying out cutting and fragmentation, to fragmentation according to the standard of applicable speech recognition After sentence file carry out speech/non-speech, the differentiation of broadband/arrowband and identify, finally extract Include the feature text data of phonetic feature, the steps flow chart of pretreatment operation as in figure 2 it is shown, Specifically include following steps:

Step S21, voice data is carried out cutting and fragmentation process and generate several sentences literary compositions Part.

Voice data owing to receiving is the more complete data block of ratio, needs its cutting and broken Sheetization processes, and generates several sentence files little, that be suitable for speech recognition system process.Tool The dicing process of body is as follows:

First this voice data is resolved, analyzes the energy signal value of each audio sample point, Finding mute position, in the present embodiment with 50 frames, frame 200 sampled point is as quiet point Threshold values, when exceeding this quiet some threshold values, illustrates that this point is mute position；Find mute position it After, according to mute position, voice data is carried out cutting, i.e. fragmentation and generate discrete sentence literary composition Part, and each sentence file is stamped time marking, the sentence file finally given is with pcm lattice Formula preserves.

Step S22, distich subfile carry out non-voice filtration, leave speech sentence file.

Owing to step S21 simply carries out cutting according to mute position to voice data, wherein also Including substantial amounts of non-voice context, and these contents do not have any side for follow-up audio identification Help, also do not have any positive effect, contrary, owing to the existence of non-voice context also can Increase the weight of the speech recognition system transmission to voice data and the processing load of calculating, also result in by mistake The generation identified, it is therefore desirable to the sentence file generated is carried out non-voice filtration, i.e. to fragment Sentence file after change carries out speech/non-speech differentiation, remaining speech sentence file, this step Specific as follows:

First, resolve the sentence file after each fragmentation, according to speech/non-speech classification mould Type, carries out the differentiation of speech/non-speech by grader to each sentence file；

Secondly, according to differentiating result, the sentence file of non-voice is carried out deleting the operation of mark, And the sub-time location of protocol sentence.

The present embodiment employ a kind of based on support vector machine (Support Vector Machine, be called for short SVM) audio frequency classification method, be primarily based on energy threshold, short sentence Son is divided into quiet and non-mute, then by selecting effectively and the audio frequency characteristics of robust, non- Mute signal is divided into 4 classes: voice (pure voice, non-pure voice), non-voice (music, ring Border sound), the method has the highest classification accuracy and processing speed, this audio frequency classification method Technological frame as shown in Figure 3.

Step S23, each speech sentence file is carried out wide and narrow strip differentiation, to being determined as broadband The speech sentence file of signal adds broadband mark, it determines for the speech sentence file of narrow band signal Add arrowband mark.

Each speech sentence is carried out wide and narrow strip differentiation, in order to according to differentiating that result is subsequent voice Selecting which kind of speech recognition modeling to provide reference during identification, this step is specific as follows:

First, the speech sentence segment processed remaining applicable speech recognition system after filtering is entered Row is analyzed one by one, it determines its speech sentence is broadband (high sampling rate) or arrowband (low sampling rate), To select which kind of speech recognition modeling to provide reference during subsequent speech recognition；

Secondly, every speech sentence is carried out wide and narrow strip mark, i.e. the voice sentence to broadband signal Subfile adds broadband mark, and the speech sentence file of narrow band signal is added arrowband mark.

Concrete, in the present embodiment, wide and narrow strip differentiates by analyzing the spectrum energy in audio signal Value differentiates, when the spectral energy values of more than 8K is more than 0.1, this audio signal is wide Band, when the spectral energy values of below 8K is less than or equal to 0.1, this audio signal is then narrow Band signal.

Step S24, the speech sentence file identifying interpolation broadband mark and arrowband carry out audio frequency Feature extraction, obtains feature text data, wherein comprises this speech sentence in feature text data Beginning and ending time, voice characteristics information, the audio-video document title of this sentence ownership and correspondence Wide and narrow strip identifies.

For saving network bandwidth resources, after speech sentence file is added wide and narrow strip mark, also The extraction of audio frequency characteristics to be carried out, is converted into text feature data by voice data, to reduce net The data volume of network transmission, specific as follows:

First, divide one by one to the speech sentence file adding broadband mark and arrowband mark Analysis, extracts MFCC(Mel Frequency Cepstrum Coefficient, Mel frequency cepstral Coefficient) and PLP(Packet Level Protocol, packet level protocol) phonetic feature, this is At two kinds of phonetic features that field of speech recognition is conventional；

Secondly, every phonetic feature after extraction is carried out time marking so that finally obtain Feature text data comprises the beginning and ending time of this speech sentence, belongs to which audio-video document The wide and narrow strip mark of file name and correspondence.

It should be noted that input speech signal is not only converted into by this step compares robust and tool There is the phonetic feature of separating capacity, for distinguishing different speakers, and at feature extraction base Also having carried out certain normalization on plinth, normalization content therein includes:

1) mean normalization CMN, mainly reduces channel effect；

2) normalized square mean CVN, the main additive noise that reduces affects；

3) sound channel length normalization VTLN, the impact that main reduction sound channel difference causes；

4) Gaussian Gaussianization, is the extended method of CMN+CVN；

5) anti-noise algorithm, reducing background noise affects systematic function, uses AWF and VTS Algorithm.

Preferably, feature text data is sent to Cloud Server by the present embodiment step S3, enters Enter speech recognition flow process.The present embodiment medium cloud server calls module uses Web Service to connect Mouthful agreement, radio and television mission bit stream to be identified is sent in the way of XML message to Server end carries out speech recognition.Wherein the XML message of identification mission comprises herein below:

1) radio and television file name to be identified；

2) the sentence listed files of fragmentation；

3) the speech/non-speech mark of each sentence file；

4) broadband of each sentence file/arrowband mark；

5) the phonetic feature text of each sentence file being accredited as voice；

6) the beginning and ending time mark of each sentence file.

Cloud server, to after identification mission, is identified process and includes: men and women's sound identification, Speaker Identification, voice content identification and punctuation mark identification, generate the voice containing mark and know Other result, this step is specific as follows:

(1) can with XML(by phonetic feature text corresponding for speech sentence file to be identified Extension language) mode of message is sent to far-end one by one for radio and television voice recognition processing With server, in XML message in addition to comprising phonetic feature text data, also should Comprise following information: beginning and ending time, this speech sentence file that speech sentence file is corresponding belong to Radio and television audio-video document title, this speech sentence file wide and narrow strip mark；

(2) speech recognition system in Cloud Server is based on cloud computing framework establishment, works as voice When the feature text of sentence is sent to radio and television speech recognition cloud, taken according to cloud by controller Calculate resource in business device takies situation, for the identification reasonable distribution meter of this speech sentence file Calculate resource；

(3) speech recognition system is called the calculating resource being assigned to and is carried out phonetic feature respectively Men and women's sound identification, Speaker Identification, voice content and punctuation mark identification, wherein men and women's sound is known Not according to men and women's sound disaggregated model, the classification each sentence being carried out men and women's sound by grader is sentenced Not and identify；Speaker Identification, according to speaker model storehouse, carries out speaker's to each sentence Identify and identify；Voice content identification and punctuation mark identification carry out voice content to each sentence Identification, labelling punctuation mark simultaneously, and each vocabulary identified is carried out time-labeling.

Preferably, voice identification result is merged and structuring literary composition by the present embodiment step S4 Specifically including of this mark:

Step S41, each voice identification result is collected, aligns, and according to wherein wrapping The beginning and ending time contained is ranked up, concrete: the recognition result for each speech sentence is carried out Merge, carry out collecting arrangement, by each sentence according to its radio and television audio-video document belonged to Different recognition results (men and women's sound identification, Speaker Identification, voice content and punctuation mark identification) Align according to time point, and carry out time-sequencing.

Step S42, to sequence after voice identification result be marked according to structured format, Including speaker's sex mark, speaker's mark, voice content, punctuation mark and timestamp, Concrete: for the recognition result sorted, carry out text according to specific structurized form Result identifies, and mark content includes speaker's sex of each sentence file, speaker, sentence In voice content, the timestamp of each voice word, the punctuation mark of the sentence point of interruption in sentence.

Ultimately produce structurized voice identification result, the most again by voice identification result with The form of XML message feeds back to user, and wherein XML message comprises herein below:

1) the radio and television file name identified；

2) the sentence listed files of fragmentation；

3) the speech/non-speech mark of each sentence file；

4) broadband of each sentence file/arrowband mark；

5) voice identification result of each sentence file；

6) speaker's mark of each sentence file；

7) men and women's tone mark of each sentence file is known；

8) the beginning and ending time mark of each sentence file.

Preferably, the present embodiment is the accuracy rate ensureing speech recognition, knows in step S3 The process of other places reason is identified with language model storehouse according to acoustic model repository, wherein language Model library is constantly updated by the collection to network text and the study to network text.Fixed Phase carries out the collection of network text by the Internet, by periodically optimizing the study of network text Language model storehouse, specific as follows:

1) fixed-period crawling network text from the Internet, by web crawlers, periodically to each greatly Search engine (such as Baidu, Google, search, search dog, search storehouse etc.) and each big radio and television Relevant portal website (such as CCTV's net, each earth mat platform, Sina, Sohu etc.) captures webpage chain Connect, collect popular vocabulary and web documents.

2) network text by collecting carries out participle to web documents, and adds up word frequency, word Number, by word segmentation result, network hot word collection result and this speech recognition system of statistical data typing Language model storehouse in system, carries out reference for each sound identification module, it is achieved to language model storehouse Regular update, to ensure the accuracy rate of radio and television speech recognition.

Based on above-mentioned, the present embodiment carries out the idiographic flow of speech recognition such as to broadcast television data Shown in Fig. 4, specifically include:

First, receive broadcast television data, send it to pretreatment terminal and carry out audio frequency and video solution Code, therefrom extracts voice data, carries out audio frequency cutting and fragmentation afterwards, to fragmentation After sentence file carry out speech/non-speech differentiation, if voice then continues next step, Otherwise it is marked as non-voice, does not continue with.For speech sentence file continue into Line width arrowband differentiates, speech feature extraction, then is known by voice by the feature text data obtained Other " cloud " calls, and as voice recognition tasks, it is sent to cloud service using XML message Device carries out voice recognition processing.The cloud service platform of cloud server end carries out men and women's sound respectively to it Identification, Speaker Identification, voice content identification and punctuation mark identification, then recognition result is entered Row fusions etc. feed back to and service platform after processing, simultaneously from the new network words of e-learning, Popular vocabulary etc. carry out regular update to the language model storehouse of cloud service platform, it is ensured that speech recognition Accuracy rate.Finally, Cloud Server is by recognition result, and the most structurized speech recognition is tied The intellectualities further such as fruit feeds back to user by XML form, for reference, retrieval Process.

The recognition methods provided by the present embodiment, based on cloud computing to existing voice recognition methods Improve, merge broadcast television data preconditioning technique, men and women's sound identification technology, speaker Identification technology and radio and television audio recognition method, have after speech data is carried out pretreatment again The data handling requirements of broadcast television industry is identified processing, to broadcast television data by body acupuncture Pre-processed results, men and women's sound recognition result, Speaker Identification result and voice identification result enter Row merges and structured text mark, the voice identification result of generating structure, it is possible to for rear The continuous intellectuality of other broadcast television services, automatic business processing provide basic data, specifically include Below some:

5) recognition result to voice and the mark result to voice word timestamp can be The retrieval service of radio and television voice content provides basic data；

6) the cutting time point to speech sentence identifies result, and the differentiation knot of wide and narrow strip Really, can be broadcast TV program split provide boundary time point reference；

7) to the identification of voice content in radio and television and the identification of punctuation mark, permissible Content reference is provided for the subtitle recognition in broadcast TV program；

8) to the Speaker Identification of speech sentence and the differentiation of wide and narrow strip as a result, it is possible to be Host in broadcast TV program identifies, welcome guest identifies, scene Recognition of speaking (indoor scene, Outdoor scene) etc. provide foundation.

It addition, processing speed accelerate, it is possible to reply mass data speech recognition problem, also by Language model storehouse is learnt and renewal in periodically, it is possible to increase the accuracy of speech recognition.

Embodiment two

The embodiment of the present invention two additionally provides a kind of radio and television speech recognition system, composition signal Figure is as it is shown in figure 5, this system includes:

Extraction unit 10, extracts voice data according to broadcast television data；

Pretreatment terminal 20, carries out pretreatment to voice data, obtains feature text data, and It is sent to Cloud Server 30；

Cloud Server 30, is identified feature text data processing, obtains voice identification result, And voice identification result is merged and structured text mark, the voice of generating structure Recognition result.

Preferably, the composition schematic diagram of the pretreatment terminal 20 in the present embodiment as shown in Figure 6, Specifically include:

Cutting module 21, carries out cutting to voice data and fragmentation processes and generates several sentences File；

Non-voice filtering module 22, distich subfile carries out non-voice filtration, leaves speech sentence File；

Wide and narrow strip discrimination module 23, carries out wide and narrow strip differentiation to each speech sentence file, to sentencing Not Wei broadband signal speech sentence file add broadband mark, it determines for the voice of narrow band signal Sentence file adds arrowband mark；

Audio feature extraction module 24, to the speech sentence literary composition adding broadband mark and arrowband mark Part carries out audio feature extraction, obtains feature text data, wherein comprises in feature text data The beginning and ending time of this speech sentence, voice characteristics information, the audio-video document name of this sentence ownership Claim and corresponding wide and narrow strip mark.

Preferably, the composition schematic diagram of the Cloud Server 30 in the present embodiment is as it is shown in fig. 7, have Body includes:

Men and women's sound identification module 31, for carrying out men and women's sound identification to feature text data.

Due in terms of physiology and psychology, male, women have spoken obvious difference, such as sound The resonance that the fundamental tone of band generation, oral cavity structure (laryngopharynx, tongue, palate, lip, tooth etc.) produce Peak frequency, the size of exhaled air flow and power etc..Therefore voice signal comprises the property of speaker Other feature.In the present embodiment, by GMM-SVM(Gaussian Mixture Models-Support Vector Machines) technological frame of mixed model, establish entirety Men and women sound identification (the i.e. speaker of change spatial modeling (Total Variability Modeling) Sex identification).All change spatial modelings, when training space matrix, do not repartition speaker Space and channel space, represented by overall space, simplifies the mathematical notation in space, greatly Reduce greatly the degree of dependence to training data.Merged by multisystem, provide final sex Result judges.

Speaker Identification module 32, for carrying out Speaker Identification to feature text.

Speaker Identification realizes based on two class difference between speaker in the present embodiment: One is itself to there are differences in the pronunciation of different vocal tract spectrum characteristic, and this species diversity is embodied in pronunciation Phonetic feature distribution upper the most different；Two is the high-level feature (high-level of different speaker Features) there are differences, i.e. different with background due to living environment, the day after tomorrow is formed, as practised The differences such as idiom, the rhythm, linguistic structure.The Speaker Recognition System of main flow the most in the world It is essentially all based on these features, solves Speaker Identification by the method for statistical modeling and ask Topic.Concrete, Speaker Recognition System includes following two module:

A, speaker's modeling tool module: the method trained by differentiation, such as support vector machine SVM, or method based on statistical modeling, such as gauss hybrid models GMM, to speaker It is modeled, portrays different speaker's respective feature space distribution character, be used for distinguishing difference Speaker.

B, speaker's distinguished number module: by feature and corresponding speaker's mould of input voice Type mates, and differentiates the speaker's identity of input voice according to matching degree.

Voice content and punctuation mark identification module 33, for carrying out voice content to feature text Identify and punctuation mark identification, generate the voice identification result containing mark.

Module comprises 4 ingredients: acoustic model repository, language model storehouse, search for and decode, Punctuation mark generate, workflow diagram as shown in Figure 8, input phonetic feature after, according to this language Sound feature is broadband signal or narrow band signal, by search and decoder module Selection and call intelligence Voice content is identified by the acoustic model repository practised and come with language model storehouse, generates after identification Text (sentence) send into punctuation mark generation module and carry out the identification of punctuation mark, finally give birth to Become the voice identification result with punctuation mark mark.

The identification technology introduction that 4 ingredients are respectively adopted is as follows:

A, acoustic model repository: use in the present embodiment based on CD-DNN-HMM( The hidden Markov model of hereafter relevant deep neural network) acoustic model repository, ratio is traditional Hidden Markov model based on GMM-HMM(gauss hybrid models) acoustic model repository knowledge Other accuracy rate is higher.

B, language model storehouse: use N-Gram(N metagrammar in the present embodiment) language Model, this model based on such a it is assumed that the appearance of the n-th word only and above N-1 word Relevant, and the most uncorrelated with other any word, and the probability of whole sentence is exactly each word probability of occurrence Product.These probability can be by directly adding up the number of times that N number of word occurs simultaneously from language material Obtain.N-Gram language model is simply effective, is widely used by speech recognition industry.

C, search for and decode: use the dynamically rule such as Viterbi searching algorithm in the present embodiment The method of drawing, search optimal result in the case of setting models；Viterbi based on dynamic programming Algorithm each state on each time point, after calculating decoded state sequence pair observation sequence Test probability, retain the path of maximum probability, and corresponding status information under each nodes records Word decoding sequence is reversely obtained so that last.Viterbi algorithm is not losing the condition of optimal solution Under, solve HMM model status switch and acoustics observation sequence in continuous speech recognition simultaneously Nonlinear Time alignment, word boundary detection and the identification of word, the speech recognition being also conventional is searched The elementary tactics of rope.

Punctuation mark generates: have employed one in the present embodiment and utilizes in the interpolation of plain text information The method of literary composition uttered sentence end of the sentence punctuate.The method is from the different grain size angle of sentence, and modeling is complete The relation of office lexical information and punctuate, and use multilayer perceptron to merge under different grain size The punctuate model arrived, it is achieved thereby that punctuate (fullstop, question mark and exclamation) generates.

Recognition result processing module 34, merges voice identification result and structured text Mark, the voice identification result of generating structure.Wherein in the present embodiment, recognition result processes Module 34 is first to the voice identification result (band of each speech sentence file in broadcast television data Punctuation mark, each voice word band timestamp) carry out collecting and merging.

Preferably, the recognition result processing module 34 in the present embodiment farther includes:

Preferably, the Cloud Server 30 in the present embodiment also includes: language model intelligence Practise module 35, for fixed-period crawling network text, by the study of network text the most more Newspeak model library, in identification processing procedure, the language model storehouse according to regular update is known Not, to guarantee the accuracy rate of speech recognition..

Cloud Server 30 in the present embodiment is to realize based on speech recognition cloud service platform 36 , the cloud service that concrete speech recognition cloud service platform combines based on ICE with SOA is put down Table frame builds, ICE framework complete Distributed Calculation, external by SOA framework Cloud service is provided, completes the identification mission of sing on web Service and the communication of recognition result.

In the present embodiment in service platform, by various identification modules (i.e. men and women's sound identification module 31, Speaker Identification module 32, voice content and punctuation mark identification module 33 and identification knot Really processing module 34) it is encapsulated into plug-in unit, form the cloud service of standard, configure in the frame, Becoming a part for cloud service platform, various identification modules can be properly functioning in the system that do not affects In the case of add easily in platform and unload, when data volume to be identified increases, cloud Service platform will add identification module adaptively, to complete the radio and television speech recognition of magnanimity Task.

This cloud service platform framework is as it is shown in figure 9, after broadcast television data completes pretreatment, logical Cross and call data access interface voice recognition tasks is passed to control with XML task message single Unit, (is calculated the state of resource by prison by control unit according to the state of current calculating resource Control unit is collected), mainly include CPU, internal memory, network state, appointing in conjunction with recognition node Business execution state, task priority, and the priori of execution efficiency, dynamic decision also divides The calculating resource joining optimum completes the execution of identification mission.

In sum, the identification system globe area broadcast television data pretreatment skill that the present embodiment provides Art, men and women's sound identification technology, speaker Recognition Technology and radio and television audio recognition method, The data handling requirements of broadcast television industry it is specific to again after speech data is carried out pretreatment It is identified processing, to broadcast television data pre-processed results, men and women's sound recognition result, speaks People's recognition result and voice identification result carry out merging and structured text mark, generate knot The voice identification result of structure, it is possible to for the intellectuality, automatically of other broadcast television services follow-up Change processes provides basic data.Further, since use the speech data parallel processing to fragmentation Mode, processing speed is accelerated, it is possible to the speech recognition problem of reply mass data, simultaneously by In periodically, language model storehouse is carried out intellectual learning and renewal, it is possible to increase speech recognition accurate Degree.

Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, relevant The those of ordinary skill of technical field, without departing from the spirit and scope of the present invention, Can also make a variety of changes and modification, the technical scheme of the most all equivalents falls within the present invention Category, the scope of patent protection of the present invention should be defined by the claims.

Claims

1. a radio and television audio recognition method, it is characterised in that including:

S1, extract voice data according to broadcast television data；

S2, described voice data is carried out pretreatment, obtain feature text data；Wherein, step Rapid S2 carries out pretreatment to described voice data and specifically includes:

S24, the speech sentence file identifying interpolation broadband mark and arrowband carry out audio frequency characteristics Extract, obtain feature text data, wherein said feature text data comprises this speech sentence Beginning and ending time, voice characteristics information, the audio-video document title of this sentence ownership and correspondence Wide and narrow strip identifies；

S3, described feature text data is sent to Cloud Server be identified process, obtain man Female voice identification, Speaker Identification and voice identification result；Step S3 is by described feature textual data It is identified process includes according to being sent to Cloud Server: men and women's sound identification, Speaker Identification, language Sound content recognition and punctuation mark identification, generate the voice identification result containing mark；And walk The process that rapid S3 is identified processing is identified according to language model storehouse, and described voice Model library is constantly updated by network text collection and network text study；Described voice mould The renewal step in type storehouse includes:

S31, from the Internet fixed-period crawling network text；

S32, by the network text collected web documents carried out participle, and add up word frequency, Word number, by word segmentation result, network hot word collection result and this speech recognition of statistical data typing Language model storehouse in system, carries out reference for each sound identification module, it is achieved to language model The regular update in storehouse, to ensure the accuracy rate of radio and television speech recognition；

S4, to described data prediction, men and women's sound identification, Speaker Identification and speech recognition Result carries out merging and structured text mark, the voice identification result of generating structure；Step Institute's speech recognition result is merged by rapid S4 and structured text mark specifically includes:

2. a radio and television speech recognition system, it is characterised in that this system includes:

Extraction unit, extracts voice data according to broadcast television data；

Pretreatment terminal, carries out pretreatment to described voice data, obtains feature text data, And it is sent to Cloud Server；Described pretreatment terminal includes:

Audio feature extraction module, to adding broadband mark and the speech sentence file of arrowband mark Carry out audio feature extraction, obtain feature text data, wherein said feature text data wraps Beginning and ending time containing this speech sentence, belong to the wide and narrow strip mark of audio-video document title and correspondence Know；

Cloud Server, is identified described feature text data processing, and obtains speech recognition knot Really, and institute's speech recognition result is merged and structured text mark, generating structure The voice identification result changed；Described Cloud Server includes:

Recognition result processing module, merges institute's speech recognition result and structuring literary composition This mark, the voice identification result of generating structure；Described recognition result processing module is further Including:

Add mark module, for the voice identification result after sequence is carried out according to structured format Labelling, including speaker's sex mark, speaker's mark, voice content, punctuation mark and Timestamp；

Described Cloud Server also includes: language model intellectual learning module, for fixed-period crawling Network text, by the study regular update language model storehouse to network text, at identifying processing During be identified according to the language model storehouse of regular update；Described language model intellectual learning Module is used for performing following steps:

S31, from the Internet fixed-period crawling network text；

S32, by the network text collected web documents carried out participle, and add up word frequency, Word number, by word segmentation result, network hot word collection result and this speech recognition of statistical data typing Language model storehouse in system, carries out reference for each sound identification module, it is achieved to language model The regular update in storehouse, to ensure the accuracy rate of radio and television speech recognition.