CN105957531B

CN105957531B - Speech content extraction method and device based on cloud platform

Info

Publication number: CN105957531B
Application number: CN201610260647.7A
Authority: CN
Inventors: 俞凯; 谢其哲; 吴学阳; 李文博; 郭运奇
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2019-12-31
Anticipated expiration: 2036-04-25
Also published as: CN105957531A

Abstract

A speech content extraction method and device based on a cloud platform comprise the following steps: collecting audio and video of a speech, caching the collected audio and video into a PC (personal computer), and preprocessing the audio and video; sending the preprocessed audio and video and the relevant data including a lecture slide, relevant reading materials and the like to a server; the server performs voice segmentation on the received audio and segments the audio according to speakers; performing automatic voice recognition to convert the segmented audio into characters, wherein the voice recognition uses acoustic self-adaptation and language model self-adaptation; keywords are extracted from the speech recognized text and content notes are generated. The method identifies the audio frequency into a text form capable of being read repeatedly through voice identification, and improves the identification accuracy rate by using language model self-adaption and acoustic model self-adaption. And the knowledge integration is carried out, so that the time is prevented from being spent on reading redundant information. The invention also discloses a speech content extraction device based on the cloud platform, which comprises a speech recording module, a material sending module, a voice segmentation module, a voice recognition module and a keyword and content note extraction module.

Description

Speech content extraction method and device based on cloud platform

Technical Field

The invention relates to a technology in the field of word processing, in particular to a method and a device for extracting speech content based on a cloud platform.

Background

In the information age, technological advances and advances have made it possible to obtain information from all over the world, ancient and today, in quantities far exceeding the range of human audibility. In order to help people to acquire information more efficiently, the voice signal processing and natural language processing technology can effectively and automatically process massive information and extract key information and content in the massive information for people to read quickly.

In life, everyone listens a large amount of information through channels such as media, classes and the like every day, and the extraction of the information into a text form capable of being read repeatedly becomes important, so that people can read and learn quickly, and the accuracy of voice recognition is improved by the language model self-adaption method and the acoustic model self-adaption method. And the knowledge integration is carried out, so that the time is prevented from being spent on reading redundant information.

It has been found through prior art search that chinese patent document CN102292766B discloses a "method and apparatus for speech processing", which relates to a method, apparatus and computer program product for providing an architecture for a composite model for speech recognition adaptation, selecting a model based on speech characteristics of a specific speaker to improve recognition accuracy. But this approach does not involve adaptation to the language model to improve accuracy for professional vocabulary.

Further retrieval finds that Chinese patent document No. CN102122506A discloses a speech recognition method, and the system utilizes a search engine to retrieve relevant text training language models, thereby improving speech recognition rate and reducing workload of manual proofreading. However, the method needs an external search engine, is long in time consumption and is not beneficial to processing a large amount of voice.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a speech content extraction method and device based on a cloud platform, which are used for recognizing audio into a text form capable of being read repeatedly through voice recognition and improving the recognition accuracy by using language model self-adaptation and acoustic model self-adaptation. And the knowledge integration is carried out, so that the time is prevented from being spent on reading redundant information.

The invention is realized by the following technical scheme:

the invention relates to a speech content extraction method based on a cloud platform, which comprises the following steps:

step 1) collecting audio and video of a speech, caching the collected audio and video into a PC (personal computer), and preprocessing the audio and video;

step 2) sending the preprocessed audio/video and related data including lecture slides, related reading materials and the like to a server;

step 3) the server performs voice segmentation on the received audio and segments the audio according to speakers;

step 4) automatic voice recognition is carried out to convert the segmented audio frequency into characters, and the voice recognition uses acoustic self-adaptation and language model self-adaptation;

and 5) extracting keywords from the text of the voice recognition and generating a content note.

Preferably, the collection is performed by using a microphone, a camera and other equipment to collect the audio and video of the speech, and the audio and video are simultaneously cached in the PC by using a wired or wireless network;

and (3) carrying out voice enhancement on the audio by using a PC (personal computer) to remove noise, and compressing the audio and video.

The voice segmentation mode is that the server detects voice activity of the received audio and segments the audio according to voice pause; the voice is divided according to the speaker, namely, the speaker of each section of voice is identified, and the audio is divided according to the speaker.

The acoustic self-adaptation comprises the adaptation to the recording environment, the noise type, the speaker type and the like;

the language model adaptation includes adaptation to professional vocabulary in courseware and related reading materials.

The extraction comprises the following steps: and extracting keywords related to the speech content in the text subjected to voice recognition, and extracting notes related to the speech according to the relevance of each sentence in the text to the speech content.

The invention relates to a speech content extraction device for realizing the method, which comprises the following steps: the system comprises a speech recording module, a material sending module, a voice segmentation module, a voice recognition module and a keyword and content note extraction module, wherein the speech recording module is used for collecting speech audios and videos, caching the collected audios and videos into a Personal Computer (PC) in a classroom and carrying out pretreatment, the material sending module is used for sending the pretreated audios and videos and relevant data including speech slides, relevant reading materials and the like to a server, the voice segmentation module is used for carrying out voice segmentation on the received audios and segmenting the audios according to speakers, the voice recognition module is used for carrying out automatic voice recognition to convert the segmented audios into characters, and the voice recognition module uses acoustic self-adaptation and language model self-adaptation and the keyword and content note extraction module is used for extracting keywords from.

The speech recording module is used for collecting audio and video of speech by using devices such as a microphone and a camera, simultaneously caching the audio and video into a PC (personal computer) by using a wired or wireless network, performing voice enhancement on the audio by using the PC to remove noise, and compressing the audio and video.

The voice segmentation is used for carrying out voice activity detection on the received audio and segmenting according to voice pause; the speaker-to-speaker segmented speech is used for identifying the speaker of each section of speech, and the audio is segmented according to the speaker.

The voice recognition module is used for obtaining a text corresponding to each sentence of audio by using automatic voice recognition, and the acoustic self-adaption is used for adapting to a recording environment, a noise type, a speaker type and the like; the language model adaptation is used for adaptation to professional vocabularies in lecture slides and related reading materials.

The keyword and content note extraction module is used for extracting keywords related to the speech content in the speech recognition text and extracting notes related to the speech according to the relevance of each sentence in the text to the speech content.

Technical effects

Compared with the prior art, the invention recognizes the audio frequency into the text form which can be read repeatedly through the voice recognition, and improves the recognition accuracy rate by using the language model self-adaption and the acoustic model self-adaption. And the knowledge integration is carried out, so that the time is prevented from being spent on reading redundant information.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of the structure of the device of the present invention.

Detailed Description

Example 1

The embodiment comprises the following steps:

101. collecting audio and video of a speech, caching the collected audio and video into a PC (personal computer), and preprocessing the audio and video;

in the embodiment of the invention, the audio and video of the speech are collected, the collected audio and video are cached in the PC, the preprocessing comprises the steps of collecting the audio and video of the speech by using equipment such as a microphone, a camera and the like, and the audio and video are simultaneously cached in the PC by utilizing a wired or wireless network; and (3) carrying out voice enhancement on the audio by using a PC (personal computer) to remove noise, and compressing the audio and video.

102. Sending the preprocessed audio and video and the relevant data including a lecture slide, relevant reading materials and the like to a server;

103. the server performs voice segmentation on the received audio and segments the audio according to speakers;

in the embodiment of the invention, the voice segmentation mode is that the server detects voice activity of the received audio and performs segmentation according to voice pause; the voice is divided according to the speaker, namely, the speaker of each section of voice is identified, and the audio is divided according to the speaker.

104. Performing automatic voice recognition to convert the segmented audio into characters, wherein the voice recognition uses acoustic self-adaptation and language model self-adaptation;

in the embodiment of the invention, the acoustic adaptation comprises adaptation to recording environment, noise type, speaker type and the like; the language model adaptation includes adaptation to professional vocabulary in lecture slides and related reading material.

105. Keywords are extracted from the speech recognized text and content notes are generated.

Extracting keywords from the speech recognized text and generating a content note from the speech recognized text in the present example includes: and extracting keywords related to the speech content in the text subjected to voice recognition, and extracting notes related to the speech according to the relevance of each sentence in the text to the speech content.

Example 2

As shown in fig. 2, a schematic structural diagram of a data search apparatus based on interactive input according to an embodiment of the present invention is provided, where the apparatus includes: the lecture recording module 21 material sending module 22 voice segmentation module 23 voice recognition module 24 and keyword and content note extraction module 25.

The speech recording module 21 is used for acquiring speech audio and video, caching the acquired audio and video into a PC (personal computer) in a classroom, and preprocessing the audio and video;

the speech recording module 21 is used for acquiring audio and video of the speech by using a microphone, a camera and other devices, simultaneously caching the audio and video into a PC (personal computer) by using a wired or wireless network, performing voice enhancement on the audio by using the PC to remove noise, and compressing the audio and video.

For example, a video camera is used for recording a deep learning course, a teacher wears a wallet type microphone, students answering questions use a wireless microphone to cache recorded videos and audios to a Personal Computer (PC) in a classroom, background sounds such as air conditioner noise and construction noise are removed by using a filtering method such as an adaptive cancellation method, and the audio and video are compressed to enable the size of a file to be suitable for network transmission.

The material sending module 22 is configured to send the preprocessed audio/video and the relevant data including the lecture slides and the relevant reading materials to the server.

Specifically, speech enhancement, compressed audio and video, deep learning slides, deep learning reading material, and the like are transmitted to the HTTP server.

The voice segmentation module 23 is configured to perform voice segmentation on the received audio and segment the audio according to the speaker.

The voice segmentation in the voice segmentation module 23 is used for performing voice activity detection on the received audio and performing segmentation according to voice pause; the voice is divided according to the speaker for identifying the speaker of each section of voice, and the audio is divided according to the speaker.

Specifically, a part with voice is cut according to short-time energy and zero-crossing rate detection, and i-vector of each voice is extracted to identify that a speaker is a teacher and different students.

The speech recognition module 24 is configured to perform automatic speech recognition to convert the segmented audio into text, where the speech recognition uses acoustic adaptation and language model adaptation.

The voice recognition module 24 is configured to obtain a text corresponding to each sentence of audio by using automatic voice recognition, and the acoustic adaptation is configured to adapt to a recording environment, a noise type, a speaker type, and the like; the language model adaptation is used for adaptation to professional vocabularies in lecture slides and related reading materials.

Specifically, clustering the audio according to i-vector during training of the acoustic model, training an acoustic model based on a deep neural network for the audio of each cluster, finding the closest cluster of the i-vector during audio identification, and using the acoustic model of the cluster.

And extracting the reverse file frequency of each word by using massive texts, and counting keywords in deep learning courseware and extended reading by using TF-IDF. As for extended reading "Gradient Descent (GD) is a common method to minimize the risk function, the loss function, random gradient descent and bulk gradient descent are two iterative solution ideas. Batch gradient descent-minimizing the loss function of all training samples so that the final solution is a global optimal solution, i.e. the parameters of the solution are such that the risk function is minimized. Stochastic gradient descent-minimizing the loss function for each sample, and although the loss function obtained from each iteration is not oriented towards the global optimal solution, the direction of the large whole is oriented towards the global optimal solution, and the final result is often in the vicinity of the global optimal solution. "the keywords" gradient descent "," random gradient descent "," batch gradient descent "," loss function ", etc. in the extended reading can be extracted, and some commonly used words such as" commonly used method "," one "," minimize ", etc. will not be listed as keywords because the TF-IDF weight is too low.

When a recursive neural network-based language model is used to calculate the complexity (complexity) of a sentence, assuming that the model parameter is θ, the original calculation formula of the complexity (complexity) is:wherein: n is the length of the sentence, and for the keywords in this field, the complexity property can be written as:

when w is_iThe key word for this field, then q (w)_i) Is 1, otherwise is 0. λ is a hyperparameter. The method can improve the recognition rate of professional vocabularies.

And a keyword and content note extracting module 25, configured to extract keywords from the text and generate a content note by the server.

The keyword and content note extraction module 25 is configured to extract keywords related to the speech content in the speech-recognized text, and extract notes related to the speech according to the relevance between each sentence in the text and the speech content.

In this case, for example, the text after speech recognition is "for many machine learning algorithms, including linear regression, logistic regression, neural network, etc., the algorithm is implemented by finding a certain cost function or an optimized target, and then using a gradient descent method as the optimization algorithm to find the minimum value of the cost function. When our training set is large, the batch gradient descent algorithm appears to be very computationally intensive. Assuming you have ten million pictures of cats, performing the batch gradient descent algorithm once is equivalent to looking at one time of ten million pictures, and we need to find some less time-consuming method to find the characteristics of most cats. In this course, we want to introduce a different approach to batch gradient descent: the random gradient decreases. "

Similarly, through TF-IDF analysis, we can obtain that the words 'gradient descent', 'random gradient descent' and 'neural network' which rarely appear in the daily text but more appear in the speech recognition result are used as key words, and obtain the TF-IDF weight values.

Then, the weight of the sentence is calculated to be the average value of the weight of each word TF-IDF in the sentence, the sentence with the highest weight is output to be used as a content note, and for many machine learning algorithms including linear regression, logistic regression, neural networks and the like, the realization of the algorithm is realized by obtaining a certain cost function or a certain optimized target, and then the minimum value of the cost function is obtained by using a gradient descent method as an optimization algorithm. When our training set is large, the batch gradient descent algorithm appears to be very computationally intensive. In this course, we want to introduce a different approach to batch gradient descent: the random gradient decreases. "

The device provided by the embodiment of the invention identifies the audio frequency into a text form capable of being read repeatedly through voice identification, and improves the identification accuracy rate by using language model self-adaptation and acoustic model self-adaptation. And the knowledge integration is carried out, so that the time is prevented from being spent on reading redundant information.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The utility model provides a speech content extraction element based on cloud platform which characterized in that includes: the system comprises a speech recording module, a material sending module, a voice segmentation module, a voice recognition module and a keyword and content note extraction module, wherein the speech recording module is used for collecting speech audios and videos, caching the collected audios and videos into a Personal Computer (PC) in a classroom and carrying out pretreatment;

the acquisition comprises the following steps: collecting audio and video of a speech by using a microphone and a camera, and simultaneously caching the audio and video into a PC (personal computer) by using a wired or wireless network; carrying out voice enhancement on the audio by using a PC (personal computer) to remove noise, and compressing the audio and the video;

the voice segmentation mode is that the server detects voice activity of the received audio and segments the audio according to voice pause; the voice is divided according to the speaker, namely, the speaker of each section of voice is identified, and the audio is divided according to the speaker;

the acoustic self-adaptation comprises adaptation to a recording environment, a noise type and a speaker type; the language model self-adaptation comprises the adaptation to professional vocabularies in lecture slides and related reading materials;

the extraction comprises the following steps: extracting keywords related to the speech content in the text subjected to voice recognition, and extracting notes related to the speech according to the relevance of each sentence in the text to the speech content;

the speech recording module collects the audio and video of the speech through a microphone and a camera, the audio and video are simultaneously cached in a PC (personal computer) by utilizing a wired or wireless network, the PC is used for carrying out voice enhancement on the audio to remove noise, and the audio and video are compressed;

the voice segmentation detects voice activity of the received audio and performs segmentation according to voice pause; the voice is divided according to the speaker and is used for identifying the speaker of each section of voice, and the audio is divided according to the speaker, and the method specifically comprises the following steps: detecting and cutting out a part with voice according to the short-time energy and the zero crossing rate, and extracting an i-vector of each section of voice to identify a speaker as a teacher and different students;

the voice recognition module is used for obtaining a text corresponding to each sentence of audio by using automatic voice recognition, and the acoustic self-adaption is used for adapting to a recording environment, a noise type and a speaker type; the language model self-adaptation is used for adapting to professional vocabularies in lecture slides and related reading materials, and specifically comprises the following steps: clustering the audio according to i-vector during model training, training an acoustic model based on a deep neural network for the audio of each cluster, finding the closest cluster of the i-vector during audio identification, and using the acoustic model of the cluster;

the keyword and content note extraction module is used for extracting keywords related to the speech content in the speech recognition text, and extracting notes related to the speech according to the relevance of each sentence in the text to the speech content, and specifically comprises the following steps: the TF-IDF is used for counting keywords in deep learning courseware and extended reading, and the recursive neural network-based language model is used for calculating the complexity of the keywords in the field:wherein: the model parameter is theta, N is the length of the sentence, when w_iThe key word for this field, then q (w)_i) Is 1, otherwise 0, λ is a hyperparameter.