CN104090955A

CN104090955A - Automatic audio/video label labeling method and system

Info

Publication number: CN104090955A
Application number: CN201410320555.4A
Authority: CN
Inventors: 徐玉林; 王政; 钟锟; 胡国亮; 梁昭; 张建华; 王丽红; 郭强
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2014-07-07
Filing date: 2014-07-07
Publication date: 2014-10-08

Abstract

The invention relates to the technical field of information labeling, and discloses an automatic audio/video label labeling method and system. The method includes the steps that knowledge points and vocabularies of all subjects are captured in advance, and a corresponding subject mapping knowledge domain is built; the subject vocabularies serve as a hot word source, and voice frequencies extracted from an audio/video to be labeled are converted into a text; key words in the text are extracted, and the subject to which the audio/video belongs and the knowledge points are determined according to the incidence relation of the keywords and the mapping knowledge domain; a label corresponding to the audio/video is built, and the label comprises the keywords and the subject to which the audio/video belongs and the knowledge points. By means of the automatic audio/video label labeling method and system, the audio/video source contents can be fully excavated, automatic labeling of labels is achieved, the amount of manual intervention is reduced, and meanwhile, a good basis is provided for the later resource push and other services.

Description

A kind of audio frequency and video label automatic marking method and system

Technical field

The present invention relates to information labeling technical field, be specifically related to a kind of audio frequency and video label automatic marking method and system.

Background technology

Flourish along with internet and education cloud, education and instruction class resource is tinkling of jades meets the eye on every side, uneven.For teacher, student, can be by a small amount of metadata, as title etc. judges whether resource is that self is required, this mode depends on metadata, and the wrongly written or mispronounced characters in title may all can affect user's judgement; Also may need completely to browse whole audio frequency and video and could determine whether the content of this resource is required resource, and complete browse whole audio frequency and video can be more consuming time.Visible, traditional this audio frequency and video obtain manner cannot meet current from magnanimity Internet resources quick obtaining meet the demand of oneself requirement resource.

To education and instruction class resource, especially audio frequency and video are carried out label automatic marking, on the one hand, the label of mark can excavate the actual content of current resource better, makes up the shortcoming of metadata deficiency, and user no longer needs completely to browse whole audio frequency and video and just can capture flesh and blood, on the other hand, the label of mark has great facilitation to resource supplying field, and therefore, the automatic marking of label is significant to the change of current modern education teaching pattern.

Summary of the invention

The embodiment of the present invention provides a kind of audio frequency and video label automatic marking method and system, can allow user in the situation that not browsing whole audio frequency and video, accurately holds the content of this audio and video resources; Reduce artificial participation amount; For the follow-up services such as resource recommendation provide foundation more accurately.

For this reason, the invention provides following technical scheme:

An audio frequency and video label automatic marking method, comprising:

Capture in advance each subject knowledge point and subject vocabulary, build corresponding subject knowledge collection of illustrative plates;

Using described subject vocabulary as hot word resource, the audio conversion that audio or video to be marked is extracted is write as text;

Extract the keyword in described text, and determine subject and the knowledge point under described audio or video according to the incidence relation of described keyword and described knowledge collection of illustrative plates;

Set up the label of corresponding described audio or video, described label comprises: subject, knowledge point under described keyword and described audio or video.

Preferably, described keyword has one or more.

Preferably, the keyword in the described text of described extraction comprises:

Described text is carried out to participle, obtain each sub-word;

Calculate the TF-IDF value of each sub-word;

Using described TF-IDF value higher than the sub-word of setting threshold as keyword, or according to described TF-IDF value order from high to low choose set number above sub-word as keyword.

Preferably, the incidence relation of described keyword and knowledge collection of illustrative plates comprises: position and number of times that described keyword occurs in described knowledge collection of illustrative plates.

Preferably, described method also comprises:

The selection number of times of counting user to label, adds, deletes or replace label according to described selection number of times.

An audio frequency and video label automatic marking system, comprising:

Handling module, for capturing in advance each subject knowledge point and subject vocabulary;

Map construction module, for building the subject knowledge collection of illustrative plates of corresponding described subject knowledge point and subject vocabulary;

Transcription module, is write as text for the audio conversion that audio or video to be marked is extracted, and during transcription, usings described subject vocabulary as hot word resource;

Keyword extracting module, for extracting the keyword of described text;

Information determination module, for determining subject and the knowledge point under described audio or video according to the incidence relation of described keyword and described knowledge collection of illustrative plates;

Label is set up module, and for setting up the label of corresponding described audio or video, described label comprises: subject, knowledge point under described keyword and described audio or video.

Preferably, described keyword has one or more.

Preferably, described keyword extracting module comprises:

Participle unit, for described text is carried out to participle, obtains each sub-word;

Computing unit, for calculating the TF-IDF value of described each sub-word;

Extraction unit, for described TF-IDF value is extracted as to keyword higher than the sub-word of setting threshold, or chooses according to described TF-IDF value order from high to low the sub-word of setting number above and is extracted as keyword.

Preferably, described system also comprises:

Optimize module, the selection number of times for counting user to label, adds, deletes or replace label according to described selection number of times.

Audio frequency and video label automatic marking method and system that the embodiment of the present invention provides, utilize speech transcription technology and abundant internet data, audio and video resources is carried out to speech transcription, keyword extraction, and determine subject and the knowledge point under audio or video according to keyword and knowledge collection of illustrative plates, realize label automatic marking, reduced artificial participation amount, can provide good foundation for the follow-up services such as resource supplying, be more conducive to teacher, student and find in time high-quality teaching resource simultaneously.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, to the accompanying drawing of required use in embodiment be briefly described below, apparently, the accompanying drawing the following describes is only some embodiment that record in the present invention, for those of ordinary skills, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the process flow diagram of embodiment of the present invention audio frequency and video label automatic marking method;

Fig. 2 is a kind of simple examples of the subject knowledge collection of illustrative plates that builds in the embodiment of the present invention;

Fig. 3 is a kind of structural representation of embodiment of the present invention audio frequency and video label automatic marking system.

Embodiment

In order to make those skilled in the art person understand better the scheme of the embodiment of the present invention, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

Flourish along with internet and education cloud, various audio and video resources are tinkling of jades meets the eye on every side, uneven.In prior art, can be by a small amount of metadata, as title etc. judges whether resource is that self is required, this mode depends on metadata, and the wrongly written or mispronounced characters in title may all can affect user's judgement; Also may need completely to browse whole audio frequency and video and could determine whether the content of this resource is required resource, and complete browse whole audio frequency and video can be more consuming time.For this reason, the embodiment of the present invention provides a kind of audio frequency and video label automatic marking method and system, makes user no longer need completely to browse whole audio frequency and video and just can capture flesh and blood.First, by instruments such as web crawlers, capture each subject knowledge point and subject vocabulary, build corresponding subject knowledge collection of illustrative plates; Then, using described subject vocabulary as hot word resource, the audio conversion that audio or video to be marked is extracted is write as text; Secondly, extract the keyword in transcription text, and subject and knowledge point under determining corresponding audio or video according to this keyword with the incidence relation of constructed knowledge collection of illustrative plates; Finally, set up the label of corresponding described audio or video, described label comprises: subject, knowledge point under described keyword and described audio or video.

As shown in Figure 1, be the process flow diagram of embodiment of the present invention audio frequency and video label automatic marking method, comprise the following steps:

Step 101, captures each subject knowledge point and subject vocabulary in advance, builds corresponding subject knowledge collection of illustrative plates.

Knowledge collection of illustrative plates, also can be described as Knowledge Map, to take scientific knowledge as quantitative study object, carry out a series of various figure of explicit knowledge's development process and structural relation, it can describe knowledge resource and the carrier thereof that the mankind have in time with visualization technique, draws, excavates, analyzes and demonstration scientific knowledge and connecting each other between them.On internet, ubiquitous hypertext link is exactly a kind of simple form of knowledge collection of illustrative plates.In embodiments of the present invention, described subject knowledge collection of illustrative plates comprises each knowledge point and the mutual relationship thereof specific to certain subject.The effect of subject knowledge collection of illustrative plates is to show the correlativity of each vocabulary in this subject, and the propelling movement of the prediction of subject and resource is had to vital effect.

In practical application, can first by instruments such as web crawlers, capture subject knowledge point and subject vocabulary, such as, this knowledge point of buoyancy in physics subject, then, usings subject knowledge point as starting point, utilizes vertical search engine to obtain the word lists of Knowledge Relation.Vertical search engine is the professional search engine for some industries, segmentation and the extension of search engine, be that the special information of certain class in web page library is once integrated, directed minute field extracts after the data that need are processed and with certain form, returns to user again.To each vocabulary, obtain its encyclopaedia content and encyclopaedia entry label, to judge whether this vocabulary is this subject vocabulary, constantly vocabulary is carried out to degree of depth traversal, forms the knowledge collection of illustrative plates of corresponding subject.As shown in Figure 2, be a kind of simple examples of the subject knowledge collection of illustrative plates that builds in the embodiment of the present invention.

Step 102, using described subject vocabulary as hot word resource, the audio conversion that audio or video to be marked is extracted is write as text.

Particularly, when the audio conversion that audio or video to be marked is extracted is write as text, can use existing speech transcription technology to carry out audio frequency and video transcription.But the complicacy due to Chinese, traditional speech transcription technology transcription accuracy rate is generally lower, can not meet the demand of practical application, still needs significantly to improve just and can be applied, especially for the education and instruction class audio and video resources that comprises a large amount of specialized vocabularies, the accuracy rate of speech transcription may be lower.

For this reason, in embodiments of the present invention, using the specialized vocabulary that grabs as hot word resource, carry out speech transcription, tone decoding can be selected traditional acoustic model and language model, is not needing to revise under the prerequisite of "current" model, can make speech transcription accuracy rate be greatly enhanced.

Step 103, extracts the keyword in described text, and determines subject and the knowledge point under described audio or video according to the incidence relation of described keyword and described knowledge collection of illustrative plates.

Particularly, during keyword in extracting described text, first to carry out participle to the text, obtain each sub-word, then calculate TF-IDF (the Term Frequency – Inverse Document Frequency of each sub-word, word frequency-reverse document frequency) value, finally judges that according to the TF-IDF value of each sub-word (frequency occurring in numerous documents by adding up frequency that each sub-word occurs in current document and this word) can this word as the keyword of the text.

In the present embodiment, while determining the keyword of text according to the TF-IDF value of each sub-word, can there is following multiple definite method, such as:

(1) setting threshold method: first the method sets TF-IDF threshold value (as 0.202), is then defined as keyword by TF-IDF value in text higher than the sub-word of setting threshold; For different texts, under same setting threshold, the keyword number of extracting may be different.

(2) set number method: first the method sets keyword number (as 5) to be extracted, the sub-word of then choosing setting number according to the TF-IDF value of each sub-word in text order is from high to low as keyword.

The keyword accuracy of using TF-IDF technology to extract is very little to the dependence of the accuracy of speech transcription, even if speech transcription accuracy rate lower than 50%, is used TF-IDF technology still can extract key word information accurately.

It should be noted that, can extract one or more sub-word in text as the keyword of text, keyword number (generally desirable 3～5) can set according to user's request.

After having determined the keyword of text, according to the incidence relation of keyword and subject knowledge collection of illustrative plates, determine subject and the knowledge point under described audio or video.Such as, if the keyword the extracting number of times that place, this knowledge point of equation solution occurs in Mathematics Discipline knowledge collection of illustrative plates is the highest, can determine the equation solution that subject under the corresponding audio or video of this keyword and knowledge point are Mathematics Discipline.

Step 104, sets up the label of corresponding described audio or video, and described label comprises: subject, knowledge point under described keyword and described audio or video.

Extracting the keyword of audio or video, determining behind the corresponding subject of these audio frequency and video and knowledge point, automatic marking can carried out to as the label of these audio frequency and video in described keyword, affiliated subject, knowledge point.Such as: extracting keyword number is 5, and the label of the audio frequency and video that this keyword is corresponding comprises: 5 keywords, affiliated subject, knowledge point, amount to 7 labels.

The audio frequency and video label automatic marking method that the embodiment of the present invention provides, utilize speech transcription technology and abundant internet data, audio and video resources is carried out speech transcription, keyword extraction, according to knowledge collection of illustrative plates, determined subject and knowledge point, can fully excavate audio and video resources content, find in time the audio and video resources of high-quality, also can provide good foundation for the follow-up services such as resource supplying.

In order further to optimize the label of automatic marking, the flesh and blood that reflects better audio frequency and video, in another embodiment of audio frequency and video label automatic marking method of the present invention, also comprise: the selection number of times of counting user to label, adds, deletes or replace label according to described selection number of times.Such as: the selection number of times of counting user to label, retain user and select number of times higher than the label of set point number threshold value, delete or replace user and select number of times lower than the label of set point number threshold value.And can be according to the deleting or add of label, hot word resource and knowledge collection of illustrative plates are carried out perfect, and then again set up more excellent audio frequency and video label.

For example: for one piece of text of transcription, the possibility of result of system mark is " gravity, universal gravitation, quality, newton, experiment, physics, Newton's law " these labels, user to the feedback procedure of label in, find that more than 90% user does not support this label of word " experiment ", and support other words, when optimizing knowledge collection of illustrative plates and hot word resource, first reduce " experiment " weights in corpus, next searches for knowledge collection of illustrative plates, associated all the other words, by " universal gravitation, quality, newton, physics, Newton's law " and " gravity " associate, and record the degree of association, along with increasing of user feedback, knowledge collection of illustrative plates can be more and more abundanter, more and more accurate.

Correspondingly, the embodiment of the present invention also provides a kind of audio frequency and video label automatic marking system, as shown in Figure 3, is a kind of structural representation of this system.

In this embodiment, described system comprises: handling module 201, and map construction module 202, transcription module 203, keyword extracting module 204, information determination module 205 and label are set up module 206.Wherein:

Handling module 201, for capturing in advance each subject knowledge point and subject vocabulary.

In practical application, handling module 201 can first capture subject knowledge point and subject vocabulary by instruments such as web crawlers, usings subject knowledge point as starting point, obtains the word lists of Knowledge Relation according to Baidu's vertical search.

Map construction module 202, for building the subject knowledge collection of illustrative plates of corresponding described subject knowledge point and subject vocabulary.

Concrete, each vocabulary that 202 pairs of map construction modules grab, obtain its Baidupedia content and Baidupedia entry label, to judge whether this vocabulary is this subject vocabulary, constantly vocabulary is carried out to degree of depth traversal, obtain the incidence relation of each subject knowledge point, each subject vocabulary, and build knowledge collection of illustrative plates according to subject knowledge point, subject vocabulary and incidence relation thereof.

Transcription module 203, is write as text for the audio conversion that audio or video to be marked is extracted, and during transcription, usings described subject vocabulary as hot word resource.

Particularly, when transcription module 203 is write as text at the audio conversion that audio or video to be marked is extracted, can use existing speech transcription technology to carry out audio frequency and video transcription.But the complicacy due to Chinese, traditional speech transcription technology transcription accuracy rate is generally lower, can not meet the demand of practical application, still needs significantly to improve just and can be applied, especially for the education and instruction class audio and video resources that comprises a large amount of specialized vocabularies, the accuracy rate of speech transcription may be lower.

For this reason, in embodiments of the present invention, transcription module 203 is usingd the specialized vocabulary that grabs as hot word resource, carry out speech transcription, tone decoding is selected traditional acoustic model or language model, do not needing to revise under the prerequisite of "current" model, can make speech transcription accuracy rate be greatly enhanced.

Keyword extracting module 204, for extracting the keyword of described text.

Particularly, keyword extracting module 204 comprises: participle unit, computing unit and extraction unit.During keyword in extracting described text, described participle unit carries out participle to described text, obtains each sub-word; Described computing unit calculates the TF-IDF value of each sub-word, described extraction unit extracts the keyword of corresponding text according to the TF-IDF value of each sub-word, such as, described TF-IDF value can be extracted as to keyword higher than the sub-word of setting threshold, or according to described TF-IDF value order from high to low, choose the sub-word of setting number above and be extracted as keyword.For different texts, under same setting threshold, the keyword number of extracting may be different.

The keyword accuracy that keyword extracting module 204 is used TF-IDF technology to extract is very little to the dependence of the accuracy of speech transcription, even if speech transcription accuracy rate, lower than 50%, still can be extracted key word information accurately.

It should be noted that, keyword extracting module 204 can be extracted one or more sub-word in text as the keyword of text, and keyword number (generally can get 3～5) can be set according to user's request.

Information determination module 205, for determining subject and the knowledge point under described audio or video according to the incidence relation of described keyword and described knowledge collection of illustrative plates.

Particularly, after having determined the keyword of text, information determination module 205 is determined subject and the knowledge point under described audio or video according to the incidence relation of keyword and subject knowledge collection of illustrative plates.Such as: if the keyword that keyword extracting module 204 the is extracted number of times that place, this knowledge point of equation solution occurs in Mathematics Discipline knowledge collection of illustrative plates is the highest, can determine the equation solution that subject under the corresponding audio or video of this keyword and knowledge point are Mathematics Discipline.

Label is set up module 206, and for setting up the label of corresponding described audio or video, described label comprises: subject, knowledge point under described keyword and described audio or video.Such as: the keyword number obtaining is 5, the label of the audio frequency and video that this keyword is corresponding comprises: 5 keywords, affiliated subject, knowledge point, label is set up module 206 need to set up 7 labels (5 keywords, affiliated subject, knowledge point) for these audio frequency and video.

The audio frequency and video label automatic marking system that the embodiment of the present invention provides, utilize advanced at present speech transcription technology and abundant internet data, audio and video resources is carried out speech transcription, keyword extraction, according to knowledge collection of illustrative plates, determined subject and knowledge point, can fully excavate audio and video resources content, find in time the audio and video resources of high-quality, also can provide good foundation for the follow-up services such as resource supplying.

In order further to optimize the label of automatic marking, the flesh and blood that reflects better audio frequency and video, in another embodiment of audio frequency and video label automatic marking system of the present invention, also comprise: optimize module (not shown), for the selection number of times of counting user to label, according to described selection number of times, add, delete or replace label.Such as: the selection number of times of counting user to label, retaining user selects number of times higher than the label of setting threshold, deleting or replace user selects number of times lower than the label of setting threshold, according to deleting or adding label, hot word resource and knowledge collection of illustrative plates are carried out perfect, and then again set up more excellent audio frequency and video label.

Between each embodiment in this instructions identical similar part mutually referring to, for system embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, the wherein said module as separating component explanation can or can not be also physically to separate, the parts that show as module can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize the some or all functions according to the some or all parts in the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.

Above the embodiment of the present invention is described in detail, has applied embodiment herein the present invention is set forth, the explanation of above embodiment is just for helping to understand method and apparatus of the present invention; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims

1. an audio frequency and video label automatic marking method, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, described keyword has one or more.

3. method according to claim 1, is characterized in that, the keyword in the described text of described extraction comprises:

Described text is carried out to participle, obtain each sub-word;

Calculate the TF-IDF value of each sub-word;

4. method according to claim 1, is characterized in that, the incidence relation of described keyword and knowledge collection of illustrative plates comprises: position and number of times that described keyword occurs in described knowledge collection of illustrative plates.

5. according to the method described in claim 1 to 4 any one, it is characterized in that, described method also comprises:

6. an audio frequency and video label automatic marking system, is characterized in that, comprising:

Keyword extracting module, for extracting the keyword of described text;

7. system according to claim 6, is characterized in that, described keyword has one or more.

8. system according to claim 6, is characterized in that, described keyword extracting module comprises:

Computing unit, for calculating the TF-IDF value of described each sub-word;

9. system according to claim 6, is characterized in that, the incidence relation of described keyword and knowledge collection of illustrative plates comprises: position and number of times that described keyword occurs in described knowledge collection of illustrative plates.

10. according to the system described in claim 6 to 9 any one, it is characterized in that, described system also comprises: