CN105957531A

CN105957531A - Speech content extracting method and speech content extracting device based on cloud platform

Info

Publication number: CN105957531A
Application number: CN201610260647.7A
Authority: CN
Inventors: 俞凯; 谢其哲; 吴学阳; 李文博; 郭运奇
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2016-09-21
Anticipated expiration: 2036-04-25
Also published as: CN105957531B

Abstract

The invention discloses a speech content extracting method and a speech content extracting device based on a cloud platform. The speech content extracting method is characterized in that audio materials and video materials of a speech are acquired, and are cached in a PC for a pretreatment; the pretreated audio materials, the pretreated video materials, related data including speech slides, and related reading materials are transmitted to a server; the server is used for voice segmentation of the received audio materials, and is used for segmenting the audio materials according to speakers; automatic voice identification is used for converting the segmented audio materials into words, and adopts acoustic self-adaption and language model self-adaption; key words are extracted from texts after the voice identification, and content notes are generated. By adopting the speech content extracting method, the audio materials are converted into the text form, which can be read repeatedly, by the audio identification, and the language model self-adaption and the acoustic model self-adaption are used to improve the identification accuracy. Because of knowledge integration, time for reading redundant information is saved. The invention also discloses the speech content extracting device based on the cloud platform. The s speech content extracting device comprises a speech recording module, a material transmitting module, a voice segmenting module, a voice identifying module, and a key word and a content note extracting module.

Description

Speech content extracting method based on cloud platform and device

Technical field

The present invention relates to the technology of a kind of word processing field, a kind of speech content extracting method based on cloud platform And device.

Background technology

In the information age, the development and progress of technology makes us can obtain every day from information all over the world, through the ages, The quantity of these information exceeds well over people can listen to the scope of digestion.In order to help the more efficient acquisition information of people, voice signal Process and natural language processing technique can effectively automatically process magnanimity information, and extract key message therein and content for people Quickly reading.

In life, everyone listens to substantial amounts of information by the channel such as media, classroom every day, and these information retrievals are become The textual form that can repeatedly read becomes most important, and it makes the people can quickly reading learning, language model adaptation and acoustics Model self-adapting method improves the accuracy rate of speech recognition.And carry out Knowledge Integration, it is to avoid time flower is being read in redundancy.

Find that Chinese patent literature CN102292766B discloses one " for speech processes through the retrieval of existing technology Method and apparatus ", the method and device relate to provide for the method for framework of the adaptive composite model of speech recognition, dress Putting and computer program, phonetic feature preference pattern based on specific enunciator improves recognition accuracy.But the method does not relates to And the accuracy rate for specialized vocabulary is improved for language model adaptation.

Retrieve discovery, Chinese patent literature CN102122506A further, disclose one " method of speech recognition ", should System utilizes the text train language model that search engine retrieving is relevant, it is possible to increase phonetic recognization rate, reduces the work of artificial check and correction Amount.But the method need to utilize external search engine, the longest, it is unfavorable for processing a large amount of voice.

Summary of the invention

The present invention is directed to deficiencies of the prior art, propose a kind of speech content extracting method based on cloud platform and device, By speech recognition, audio identification is become the textual form that can repeatedly read, use language model adaptation and acoustics model adaptation to carry High recognition accuracy.And carry out Knowledge Integration, it is to avoid time flower is being read in redundancy.

The present invention is achieved by the following technical solutions:

The present invention relates to a kind of speech content extracting method based on cloud platform, including:

Step 1) gather the audio frequency and video of speech, by the audio and video buffer that collects to PC, and carry out pretreatment；

Step 2) send pretreated audio frequency and video and related data include giving a lecture lantern slide, relevant reading material waits until server；

Step 3) server to receive audio frequency carry out phonetic segmentation, audio frequency by speaker split；

Step 4) carry out automatic speech recognition the audio conversion after segmentation is changed to word, speech recognition uses acoustics self adaptation and language Model adaptation；

Step 5) from the text of speech recognition, extract keyword and generate content notes.

Described collection, is preferably used mike, images the audio frequency and video that first-class equipment collection is given a lecture, utilize wired or wireless network It is cached in PC simultaneously；

Use PC that audio frequency carries out speech enhan-cement and remove noise, and be compressed audio frequency and video processing.

The mode of described phonetic segmentation is that server carries out voice activity detection to the audio frequency received, and carries out according to the pause of voice Cutting；The described speaker that mode is every section of voice of identification splitting voice by speaker, splits audio frequency by speaker.

Described acoustics self adaptation includes the adaptation to playback environ-ment, noise types, speaker's type etc.；

Described language model adaptation includes the adaptation of specialized vocabulary in courseware and relevant reading material.

Described extraction includes: extract keyword relevant with speech content in the text of speech recognition, and according in text every The notes relevant with speech are extracted to the degree of association of speech content.

The present invention relates to a kind of speech content extraction element realizing said method, including: it is used for gathering speech audio frequency and video, will adopt Collect to audio and video buffer in the PC in classroom, and the speech carrying out pretreatment records module, for sending pretreated sound Video and related data include that speech lantern slide, relevant reading material wait until the material sending module of server, for the sound received Frequency carries out phonetic segmentation, audio frequency is split module by the voice that speaker is split, for carrying out automatic speech recognition the sound after segmentation Frequency is converted to word, speech recognition use acoustics self adaptation and the sound identification module of language model adaptation and for server from Word extracts keyword and generates keyword and the content notes extraction module of content notes.

Described speech is recorded module and is used for using mike, imaging the audio frequency and video that first-class equipment collection is given a lecture, and utilizes wired or nothing Gauze network is cached in PC simultaneously, uses PC that audio frequency carries out speech enhan-cement and removes noise, and is compressed audio frequency and video processing.

Described phonetic segmentation, for the audio frequency received is carried out voice activity detection, carries out cutting according to the pause of voice；Described Split voice for identifying the speaker of every section of voice by speaker, split audio frequency by speaker.

Described sound identification module is for using automatic speech recognition to obtain every text corresponding to audio frequency, and described acoustics is adaptive It is applied to the adaptation to playback environ-ment, noise types, speaker's type etc.；Described language model adaptation is for speech magic lantern The adaptation of specialized vocabulary in sheet and relevant reading material.

Described keyword and content notes extraction module in the text extracting speech recognition with key that speech content is relevant Word, and extract and relevant notes of giving a lecture to the degree of association of speech content according in text every.

Technique effect

Compared with prior art, the present invention becomes audio identification by speech recognition the textual form that can repeatedly read, and uses language Model adaptation and acoustics model adaptation improve recognition accuracy.And carry out Knowledge Integration, it is to avoid time flower is being read redundancy letter On breath.

Accompanying drawing explanation

Fig. 1 is the inventive method flow chart；

Fig. 2 is apparatus of the present invention structural representations.

Detailed description of the invention

Embodiment 1

The present embodiment comprises the following steps:

101, gather the audio frequency and video of speech, by the audio and video buffer that collects to PC, and carry out pretreatment；

In present example, gather the audio frequency and video of speech, by the audio and video buffer that collects to PC, and carry out pretreatment Including using mike, imaging the audio frequency and video that first-class equipment collection is given a lecture, utilize wired or wireless network to be cached in PC simultaneously； Use PC that audio frequency carries out speech enhan-cement and remove noise, and be compressed audio frequency and video processing.

102, send pretreated audio frequency and video and related data includes that speech lantern slide, relevant reading material wait until server；

103, server carries out phonetic segmentation, audio frequency is split by speaker the audio frequency received；

In present example, the mode of described phonetic segmentation is that server carries out voice activity detection to the audio frequency received, and presses Pause according to voice carries out cutting；The described speaker that mode is every section of voice of identification splitting voice by speaker, by speaker Segmentation audio frequency.

104, carrying out automatic speech recognition and the audio conversion after segmentation is changed to word, speech recognition uses acoustics self adaptation and language Model adaptation；

In present example, described acoustics self adaptation includes the adaptation to playback environ-ment, noise types, speaker's type etc.； Described language model adaptation includes the adaptation of specialized vocabulary in speech lantern slide and relevant reading material.

105, from the text of speech recognition, extract keyword and generate content notes.

In present example, from the text of speech recognition, from the text of speech recognition, extract keyword and generate content notes Including: extract keyword relevant with speech content in the text of speech recognition, and relevant to speech content according in text every Degree extracts the notes relevant to speech.

Embodiment 2

As in figure 2 it is shown, the data serching device structural representation based on interactive mode input provided for the embodiment of the present invention, this dress Put and include: module 21 material sending module 22 voice segmentation module 23 sound identification module 24 and keyword and content are recorded in speech Notes extraction module 25.

Module 21 is recorded in described speech, is used for gathering speech audio frequency and video, by the PC of the audio and video buffer that collects to classroom In, and carry out pretreatment；

Described speech is recorded module 21 and is used for using mike, imaging the audio frequency and video that first-class equipment collection is given a lecture, and utilizes wired Or wireless network is cached in PC simultaneously, uses PC that audio frequency carries out speech enhan-cement and remove noise, and audio frequency and video are compressed place Reason.

Such as using video camera to record a degree of depth learned lesson, teacher wears clip-on microphone, answer a question The raw wireless microphone that uses, video that caching is recorded and audio frequency, in the PC in classroom, use filter method such as adaptive cancellation method to remove and carry on the back Scape sound such as air conditioner noises, construction noise etc., compression Video & Audio makes file size be suitable for network transmission.

Described material sending module 22, is used for sending pretreated audio frequency and video and related data includes give a lecture lantern slide, phase Close reading material and wait until server.

Specifically, audio frequency and video, degree of depth study lantern slide and degree of depth study reading material after transmission speech enhan-cement, compression wait until Http server.

Described voice segmentation module 23, for carrying out phonetic segmentation, audio frequency being split by speaker to the audio frequency received.

In described voice segmentation module 23, phonetic segmentation is for carrying out voice activity detection, according to voice to the audio frequency received Pause carries out cutting；Split voice for identifying the speaker of every section of voice by speaker, split audio frequency by speaker.

Specifically, detect the part being syncopated as voice according to short-time energy and zero-crossing rate, and extract the i-vector of every section of voice Identify speak artificial teacher and different students.

Described sound identification module 24, is changed to word for carrying out automatic speech recognition the audio conversion after segmentation, and voice is known Shi Yong acoustics self adaptation and language model adaptation.

Described sound identification module 24 is for using automatic speech recognition to obtain every text corresponding to audio frequency, described acoustics Self adaptation is for the adaptation to playback environ-ment, noise types, speaker's type etc.；Described language model adaptation is for speech The adaptation of specialized vocabulary in lantern slide and relevant reading material.

Specifically, during training acoustic model, audio frequency is clustered by i-vector, the audio frequency of each cluster is trained one based on The acoustic model of deep neural network, finds cluster nearest for its i-vector when identifying audio frequency, and uses this acoustic model clustered.

Use mass text to extract the reverse document-frequency of each word, use TF-IDF statistics degree of depth study courseware and extension to read In key word.As for extension read " gradient decline (GD) be a kind of common method minimizing risk function, loss function, It is two kinds of iterative thinkings that stochastic gradient descent and batch gradient decline.Gradient declines in batches---minimize the damage of all training samples Losing function so that finally solve is the optimal solution of the overall situation, the parameter i.e. solved is so that risk function is minimum.Stochastic gradient descent ---minimize the loss function of every sample, although be not the loss function that obtains of each iteration all towards global optimum direction, but Big overall direction to globally optimal solution, final result is often near globally optimal solution.", expansion can be extracted out Open and read the keyword " gradient decline " in reading, " stochastic gradient descent ", " gradient decline in batches ", " loss function " etc., and some are conventional Word is such as " common method ", and " a kind of ", " minimizing " etc. then can be not listed as key word because TF-IDF weights are the lowest.

When using language model based on recurrent neural network to calculate complexity (perplexity) in short, it is assumed that model parameter is θ, then the former computing formula of complexity perplexity is:Wherein: N is this sentence The length of son, for the keyword in this field, then complexity perplexity can be written as:

Work as w_iFor the keyword in this field, then q (w_i) it is 1, it is otherwise 0.λ is hyper parameter.The method is used to improve right Discrimination in specialized vocabulary.

Keyword and content notes extraction module 25, extract keyword from word for server and generate content notes.

Described keyword and content notes extraction module 25 in the text extracting speech recognition with pass that speech content is relevant Key word, and extract and relevant notes of giving a lecture to the degree of association of speech content according in text every.

In this instance, the text later such as through speech recognition be " for a lot of machine learning algorithms, including linear regression, Logistic regression, neutral net etc., the realization of algorithm is all by showing that certain cost function or certain optimized target are come real Existing, then use the such method of gradient decline to be used as optimized algorithm and try to achieve the minima of cost function.Training set when us Time bigger, gradient descent algorithm then seems that amount of calculation is the biggest in batches.Assume that you have the picture of 10,000,000 cats, carry out once batch Gradient descent algorithm is equivalent to read through this ten million photo, and we need to look for some the most shorter methods to find most cats Characteristic.In this course, I wants to introduce and a kind of declines different methods with batch gradient: stochastic gradient descent.”

Being similar to, analyzed by TF-IDF, we can draw occur seldom tying this section of speech recognition in daily text Occurring more word " gradient decline " in Guo, " stochastic gradient descent ", " neutral net " is as key word, and obtains their TF-IDF Weights.

The weights calculating sentence afterwards are the meansigma methods of each word TF-IDF weights in sentence, and export the sentence work that weights are the highest Take down notes for content, " for a lot of machine learning algorithms, including linear regression, logistic regression, neutral net etc., the realization of algorithm All by showing that certain cost function or certain optimized target realize, then use gradient to decline such method and come The minima of cost function is tried to achieve as optimized algorithm.When our training set is bigger, gradient descent algorithm then seems calculating in batches Measure the biggest.In this course, I wants to introduce and a kind of declines different methods with batch gradient: stochastic gradient descent.”

The device that the embodiment of the present invention provides, is become audio identification by speech recognition the textual form that can repeatedly read, makes term Speech model adaptation and acoustics model adaptation improve recognition accuracy.And carry out Knowledge Integration, it is to avoid time flower is being read redundancy In information.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can be completed by hardware, Can also instruct relevant hardware by program to complete, this program can be stored in a kind of computer-readable recording medium, above-mentioned The storage medium mentioned can be read only memory, disk or CD etc..

Above-mentioned be embodied as can by those skilled in the art on the premise of without departing substantially from the principle of the invention and objective in a different manner It is carried out local directed complete set, and protection scope of the present invention is as the criterion with claims and is not limited by above-mentioned being embodied as, in its scope Each interior implementation is all by the constraint of the present invention.

Claims

1. a speech content extracting method based on cloud platform, it is characterised in that including:

Method the most according to claim 1, is characterized in that, described collection includes: uses mike, image first-class setting The standby audio frequency and video gathering speech, utilize wired or wireless network to be cached in PC simultaneously；Use PC that audio frequency is carried out speech enhan-cement Except noise, and it is compressed audio frequency and video processing.

Method the most according to claim 1, is characterized in that, the mode of described phonetic segmentation is the server sound to receiving Frequency carries out voice activity detection, carries out cutting according to the pause of voice；Described splits the mode of voice for identifying every section by speaker The speaker of voice, splits audio frequency by speaker.

Method the most according to claim 1, is characterized in that, described acoustics self adaptation includes playback environ-ment, noise-like The adaptation of type, speaker's type etc.；Described language model adaptation includes specialty word in speech lantern slide and relevant reading material The adaptation converged.

Method the most according to claim 1, is characterized in that, described extraction includes: extract speech recognition text in The keyword that speech content is relevant, and extract and relevant notes of giving a lecture to the degree of association of speech content according in text every.

6. the speech content extraction element realizing method described in any of the above-described claim, it is characterised in that including:

Module is recorded in speech, is used for gathering speech audio frequency and video, by the audio and video buffer that collects to the PC in classroom, and carries out Pretreatment,

Material sending module, be used for sending pretreated audio frequency and video and speech lantern slide, relevant reading material to server,

Voice segmentation module, for receive audio frequency carry out phonetic segmentation, audio frequency by speaker segmentation,

Sound identification module, is changed to word for carrying out automatic speech recognition the audio conversion after segmentation, and speech recognition uses acoustics Self adaptation and language model adaptation,

Keyword and content notes extraction module, extract keyword from word for server and generate content notes.

Device the most according to claim 6, is characterized in that, described speech is recorded module and adopted by mike, photographic head The audio frequency and video of collection speech, utilize wired or wireless network to be cached in PC simultaneously, use PC that audio frequency carries out speech enhan-cement removal and make an uproar Sound, and be compressed audio frequency and video processing.

Device the most according to claim 6, is characterized in that, described phonetic segmentation carries out speech activity to the audio frequency received Detection, carries out cutting according to the pause of voice；Described splits voice for identifying the speaker of every section of voice, normally by speaker Words people splits audio frequency.

Device the most according to claim 6, is characterized in that, described sound identification module is used for using automatic speech recognition Obtaining every text corresponding to audio frequency, described acoustics self adaptation is for playback environ-ment, noise types, the adaptation of speaker's type； Described language model adaptation is for the adaptation of specialized vocabulary in speech lantern slide and relevant reading material.

Device the most according to claim 6, is characterized in that, described keyword and content notes extraction module are used for carrying Take keyword relevant with speech content in the text of speech recognition, and according to the degree of association of in text every with speech content extract with The notes that speech is relevant.