CN102568478B

CN102568478B - Video play control method and system based on voice recognition

Info

Publication number: CN102568478B
Application number: CN201210025924.8A
Authority: CN
Inventors: 吴昊宇; 邓龙; 姚键; 邱丹; 潘柏宇; 卢述奇; 刘睿姝
Original assignee: 1Verge Internet Technology Beijing Co Ltd
Current assignee: Beijing Alibaba Music Technology Co Ltd
Priority date: 2012-02-07
Filing date: 2012-02-07
Publication date: 2015-01-07
Anticipated expiration: 2032-02-07
Also published as: CN102568478A

Abstract

The utility model discloses a video control method based on voice recognition. The method comprises the steps of training the voice of a user so as to extract voice features and storing the voice features in a voice feature library; receiving a voice control command of the user and comparing the voice control command of the user with the stored user voice features; when the user voice features are matched with the user voice features in a server, extracting the voice control command and controlling the video play based on the voice control command. After the technical scheme is adopted, the technical defect in the prior art that voice recognition is applied to a single machine or software for the features need to be downloaded is overcome; in addition, as the voice features of the application are stored in the voice feature library based on a specific person, the effect of recognizing based on the voice of the specific person can be realized; and moreover, the method is high in accuracy when being used for voice recognition and control. Furthermore, the invention also discloses a video control system based on voice recognition.

Description

A kind of video playing control method based on speech recognition and system

Technical field

The present invention relates to a kind of video control method, particularly relate to a kind of video playing control method based on speech recognition, belong to field of speech recognition.

Background technology

At present, the task of Computer Distance Education is computing machine can be understood statement that the mankind speak or order, and makes corresponding action.

Wherein, from the seventies in last century, speech recognition technology of computer achieves breakthrough progress in research.Present speech recognition technology of computer is all widely used in every field, such as speech recognition dialing, phonetic search, Voice command etc.But all there are some problems in existing Computer Distance Education system.Because Computer Distance Education needs to carry out a large amount of calculating, so it is all the calculating being applied to unit substantially that existing Computer Distance Education calculates, or need download and the task that specific software just can carry out speech recognition is installed, not and Internet technology well combine.The speech recognition system that operating system carries can only complete specific simple task, with other program, or is not connected with internet, applications, can not adapts to the demand of the fast development of current internet.

Because the language of the mankind is varied, and the pronunciation of the different people of same word is also different, Computer Distance Education is from the degree of dependence of the voice to people, and the mode set up according to acoustic model divides, and can be divided into specific people discern and signer-independent sign language recognition system.

Summary of the invention

The present invention is directed to the shortcoming of prior art, provide a kind of video playing control method based on speech recognition, the method can have video control effects more flexibly.In addition, the invention also discloses a kind of video playback control system based on speech recognition.

According to the first object of the present invention, the invention provides a kind of video playing control method based on speech recognition, comprising:

Carry out training to the voice of user extract phonetic feature and be kept in phonetic feature storehouse;

Receive the voice control command of user, contrast with the user vocal feature of described preservation;

Wherein, after the user vocal feature in the phonetic feature and server of this user matches, extract this voice control command and carry out the control of video playback based on this voice control command.

Further, preferred method is, describedly carries out training to user speech and extracts phonetic feature being kept in phonetic feature storehouse, specifically comprises:

Calculate the parameters,acoustic of voice of user, extract the key characterization parameter that can reflect phonic signal character and carry out dimensionality reduction;

Obtain the training utterance of the several times control command of user's input;

After pre-service and phonetic feature, obtain the speech characteristic vector parameter of specific user and be stored in the phonetic feature storehouse in the webserver.

Further, preferred method is, described key characterization parameter adopts MFCC parameter.

Further, preferred method is, the voice control command of described reception user, contrasts with the user vocal feature of described preservation, specifically comprise:

Each instruction voice feature in storing in the voice control command of follow-up for user input and server is carried out similarity measurement, judges whether the voice control command of user mates the feature in phonetic feature storehouse.

Further, preferred method is, described video control method broadcasts spigot based on FLASH, wherein, also comprises:

Complete the identification step of corresponding user speech control command in 10 seconds, carry out corresponding video control action returning successfully.

After this invention takes technique scheme, overcome the technical disadvantages that speech recognition in prior art is all the software being applied to unit or necessary download features; Further, the phonetic feature due to the application is kept in phonetic feature storehouse, can realize the effect of the speech recognition of feature based people, and this kind of method carries out speech recognition and control, and its accuracy rate is higher.

According to another object of the present invention, the invention provides a kind of video playback control system based on speech recognition, comprising:

Phonetic feature training unit, extracts phonetic feature for carrying out training to the voice of user and is kept in phonetic feature storehouse;

Phonetic feature recognition unit, for receiving the voice control command of user, contrasts with the user vocal feature of described preservation;

Video control unit, after matching, extracts this voice control command and carries out the control of video playback based on this voice control command for the user vocal feature in the phonetic feature and server of this user.

Further, preferably, described phonetic feature training unit, specifically comprises:

Characteristic parameter extraction subelement, for calculating the parameters,acoustic of the voice of user, extracting the key characterization parameter that can reflect phonic signal character and carrying out dimensionality reduction;

Characteristic parameter training subelement, for obtaining the training utterance of the several times control command of user's input; After pre-service and phonetic feature, obtain the speech characteristic vector parameter of specific user;

Send subelement, for above-mentioned speech characteristic vector parameter being stored in the phonetic feature storehouse in the webserver.

Further, preferably, described key characterization parameter adopts MFCC parameter.

Further, preferably, described phonetic feature recognition unit, specifically comprises:

Contrast subunit, for each instruction voice feature in storing in the voice control command of follow-up for user input and server is carried out similarity measurement, judges whether the voice control command of user mates the feature in phonetic feature storehouse.

Further, preferably, described video control unit, also comprises:

FLASH player subelement;

Player controls subelement, for completing the identification of corresponding user speech control command in 10 seconds, carries out corresponding video control action returning successfully.

After this invention takes technique scheme, have all advantages of preceding method, namely the application overcomes the technical disadvantages that speech recognition in prior art is all the software being applied to unit or necessary download features; Further, the phonetic feature due to the application is kept in phonetic feature storehouse, can realize the effect of the speech recognition of feature based people, and this kind of method carries out speech recognition and control, and its accuracy rate is higher.

Other features and advantages of the present invention will be set forth in the following description, and, partly become apparent from instructions, or understand by implementing the present invention.Object of the present invention and other advantages realize by structure specifically noted in write instructions, claims and accompanying drawing and obtain.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the present invention is described in detail, to make above-mentioned advantage of the present invention definitely.

Fig. 1 is the schematic flow sheet of the video playing control method that the present invention is based on speech recognition;

Fig. 2 is the schematic diagram carrying out voice and video in one embodiment of the present of invention;

Fig. 3 is the schematic diagram carrying out voice training in one embodiment of the present of invention;

Fig. 4 is the schematic flow sheet carrying out the control of speech recognition video of one embodiment of the present of invention;

Fig. 5 is the schematic flow sheet carrying out the control of speech recognition video of an alternative embodiment of the invention;

Fig. 6 is the structural representation of the video playback control system that the present invention is based on speech recognition;

Fig. 7 is the schematic diagram of the phonetic feature training unit in one embodiment of the present of invention;

Fig. 8 is the configuration diagram of the phonetic feature training unit of one embodiment of the present of invention;

Fig. 9 is the schematic diagram of the phonetic feature recognition unit of one embodiment of the present of invention;

Figure 10 is the schematic diagram of the video control unit of one embodiment of the present of invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

embodiment of the method one

Below in conjunction with accompanying drawing, a detailed description is carried out to the present invention;

Wherein, Fig. 1 is the schematic flow sheet of the video playing control method that the present invention is based on speech recognition, and Fig. 2 is the schematic diagram carrying out voice and video in one embodiment of the present of invention;

According to the present embodiment, the described video playing control method based on speech recognition, comprising:

S101: training is carried out to the voice of some users and extracts phonetic feature;

S102: the phonetic feature of above-mentioned specific user is kept in phonetic feature storehouse;

S103: the voice control command receiving user;

S014: the user vocal feature of the voice control command of the user received and described preservation is contrasted;

S015: after the user vocal feature in the phonetic feature and server of this user matches, extracts this voice control command and carries out the control of video playback based on this voice control command.

Wherein, in step s 102, can be kept in phonetic feature storehouse based on user name with account together with concrete phonetic feature, wherein, in a preferred embodiment, this phonetic feature storehouse is the database in an Internet Server.

Further, step S103 comprises:

Wherein, the video control method described in the application, based on FLASH player, wherein, also comprises:

embodiment of the method two:

The present invention is further described, and wherein, the application mainly comprises: phonetic feature training step, phonetic feature identification step and video rate-determining steps, is described in detail respectively below to above-mentioned three steps of the present invention.

As shown in Figure 3, described method mainly comprises the following steps:

Some specific registered users open webpage, can show a speech recognition FLASH in webpage, and this FLASH technology is that prior art is comparatively known, does not describe in detail at this.

Wherein, when system get this user do not carry out phonetic feature training time, it can point out user to carry out voice training, otherwise directly enters next step;

Wherein, system can provide some basic words, such as: start, suspend, play, improve volume, F.F. etc., user carries out phonetic feature training according to above-mentioned prompting.

Wherein, in phonetic feature training step, comprising:

In the speech feature extraction stage: the parameters,acoustic calculating voice, carry out the calculating of phonetic feature, extract the key characterization parameter that can reflect phonic signal character, realize dimensionality reduction.

Wherein, in speech recognition technology, what take is MFCC and DTW technology, wherein, MFCC (Mel Frequency Cepstrum Coefficient, Mel frequency cepstral coefficient), be in the frequency-domain analysis of audio frequency, the most frequently used a kind of characteristic coefficient, applies also extensive.Its feature is the nonlinear characteristic taken into full account in the auditory system of people, uses linear graduation, use logarithmic scale at high frequencies in low frequency situation.Therefore, sound signal can be carried out more rational segmentation by MFCC.For a section audio, n group (n corresponds to sound frame number) MFCC parameter just can be obtained.Speech recognition process afterwards, just can use this n group parameter to process.

In isolated word recognition system, DTW (Dynamic Time Warping, dynamic time consolidation) is the algorithm commonly used the most, and it uses the thought of dynamic programming, solving the pronunciation template matches problem brought different in size, is a kind of comparatively classical algorithm in speech recognition.First DTW algorithm needs to train the template corresponding to isolated word to be identified.First DTW algorithm needs to train the template corresponding to isolated word to be identified.Between training sample, length is also different.Therefore template how is selected also to be a problem that must consider.

Common way is, first calculates the average length of audio sample, then using the sample closest to average length as template, using other sample as training sample, be used for train, adjustment template occurrence.Finally for the sample of length same with template, just can calculate similarity and distance, carry out identifying operation.

In the application, what mainly take is MFCC parameter, by means of this MFCC parameter, and the noise immunity that the phonetic feature of its entirety is good and robustness.

Training stage: user inputs several times training utterance, system, through pre-service and two stages of speech feature extraction, obtains the character vector of specific user.

Finally, webpage can point out user whether to upload this phonetic feature, and according to this prompting, user selects the phonetic feature of oneself to be uploaded in special sound feature database or local computing.

After the phonetic feature that trained user, user just can carry out the subsequent step such as speech recognition and video control.

embodiment of the method three:

Wherein, described speech recognition steps comprises:

Receive the voice of user's input;

Each instruction voice feature in storing in the voice control command of follow-up for user input and phonetic feature storehouse is carried out similarity measurement;

According to both similarity size to judge whether the voice control command of user mates the feature in phonetic feature storehouse.

In one embodiment, user, in viewing process, needs to click specific voice operating button; Fig. 4 is the schematic flow sheet carrying out the control of speech recognition video of one embodiment of the present of invention;

Wherein, after clicking operation button, in special time, such as, say voice control command within 10 seconds, the operational order said within these 10 seconds is considered to effective, and identifies, mates corresponding operational order, and makes a response.

In addition, in one embodiment, in viewing process, need first to say certain function word in classical Chinese writings, such as " beginning " facing to microphone, Fig. 5 is the schematic flow sheet carrying out the control of speech recognition video of an alternative embodiment of the invention;

Wherein, speech recognition program identification function word in classical Chinese writings after, in special time, such as, say voice control command in 10 seconds, the operational order said within these 10 seconds is considered to effective, and identify, mate corresponding operational order, and make a response.

Further, after speech recognition program identifies function word in classical Chinese writings 10 seconds, if do not identify voice control command, so again enter loitering phase, at this time need again to say function word in classical Chinese writings to microphone, just can carry out Voice command afterwards.

By technique scheme, solve among speech recognition process, due to the microphone of speech recognition program monitoring users all the time, avoid user in the process of viewing video, because some maloperation makes viewing experience bad, there is good technique effect.

In addition, due to after the phonetic feature of server stores user, next time, user was at other computer, or mobile device opens speech recognition program again, without the need to again training, but with the phonetic feature preserved, carry out speech recognition and video player is controlled, and then make the application carry out Voice command based on particular person, overcome the shortcoming that multiple client cannot be applied.

Such as, a certain user completes voice training and by after training the phonetic feature that obtains to upload onto the server, later in the machine, his machine or mobile device use this speech recognition flash program, without the need to retraining, two kinds of the direct selective recognition stage start speech recognition operation, again identify and and then realize Voice command.

Wherein, among the application, employ the widely used flash technology in internet, there is coverage rate high, convenient propagation, be easy to use, the features such as multiple terminals cooperation.Certainly, also can take the HTML5 technology of Microsoft, these are all that those skilled in the art can know, and do not describe in detail at this.

embodiment of the method four:

Below application example of the present invention is described:

1. the UID=1 of user A, he has downloaded the speech recognition flash program that webpage is pointed out first, the phonetic feature of the user of UID=1 had not been set up in particular person phonetic feature storehouse, prompting user just must can use speech identifying function after voice training, and provide voice training operation indicating, train rear user A that speech recognition can be used to carry out Voice command to video.

2. the UID=1 of user A, he completes voice training, later no matter in the machine, his machine or mobile device are wanted to realize speech identifying function, only need download or open flash speech recognition extender, without the need to again carrying out voice training, direct opening voice recognition function.If user adopts the mode 1 of speech recognition period, click START button and in 10 seconds, provide instruction " broadcasting ", system completes speech recognition and then makes the reaction of " broadcasting " video, as user also has other instruction then to need again to click START button, in 10 seconds, provide steering order; According to mode 2, provide function word in classical Chinese writings " beginning ", wait for that user provides subsequent instructions 10 seconds, if user provides instruction " broadcasting " in 10 seconds, and then make a response, afterwards System recover wait for user provide function word in classical Chinese writings state, as user also have other instruction then to need that function word in classical Chinese writings is described again after provide subsequent instructions again.

3. user B attempts to use the ID of user A to carry out speech recognition, provide instruction after click starts to play, the phonetic feature of server search UID=1, find that the phonetic feature of this phonetic order does not mate with the phonetic feature of UID=1 in special sound feature database, then provide information, prompting user registers or logs in the account of oneself, then carries out speech recognition operation.

In conjunction with foregoing description, be described in detail as follows to technological merit of the present invention:

1. coverage rate is high, refers to that the browser of 99% is equipped with flash plug-in unit, and present mobile device much also all supports flash plug-in unit, just can extensively dispose without the need to special support.

2. this speech recognition schemes does not need to install specific program to facilitate propagation to refer to, only needs automatically to download speech recognition program, just can use in the enterprising enforcement of flash.

3. be easy to use the Voice command referred to for Online Video, voice recognition instruction is simple, can be realized specific video playback controlling functions by a small amount of voice.

4. multiple terminals supports it is by the phonetic feature of server record user, after having changed computer or mobile device, just can carry out Voice command without the need to again training.

system embodiment one:

Be described in detail to system of the present invention below in conjunction with accompanying drawing, wherein, Fig. 6 is the structural representation of the video playback control system that the present invention is based on speech recognition;

As shown in Figure 6, the described video control system based on speech recognition, comprising:

Fig. 7 is the schematic diagram of the phonetic feature training unit in one embodiment of the present of invention; Fig. 8 is the configuration diagram of the phonetic feature training unit of one embodiment of the present of invention;

Described phonetic feature training unit, specifically comprises:

Wherein, described key characterization parameter adopts MFCC parameter.

Described phonetic feature recognition unit, specifically comprises:

Contrast subunit, for each instruction voice feature in storing in the voice control command of follow-up for user input and phonetic feature storehouse is carried out similarity measurement, judges whether the voice control command of user mates the feature in phonetic feature storehouse.

As shown in Figure 10, described video control unit, also comprises:

FLASH player subelement;

The application overcomes the technical disadvantages that speech recognition in prior art is all the software being applied to unit or necessary download features; Further, the phonetic feature due to the application is kept in phonetic feature storehouse, can realize the effect of the speech recognition of feature based people, and this kind of method carries out speech recognition and control, and its accuracy rate is higher.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, aforesaid program can be stored in a computer read/write memory medium, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM (read-only memory) (Read Only Memory, be called for short ROM), random access memory (Random Acess Memory, be called for short RAM), magnetic disc, terminal phone software or CD etc. various can be program code stored medium.

Last it is noted that the foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, although with reference to previous embodiment to invention has been detailed description, for a person skilled in the art, it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1., based on a video control method for speech recognition, comprising:

When user does not carry out phonetic feature training, training is carried out to the voice of user and extracts phonetic feature, and described UID preserved together with concrete phonetic feature in phonetic feature storehouse in the server based on the UID of user, wherein, this phonetic feature storehouse is the database in an Internet Server;

After the phonetic feature of UID and user described in phonetic feature library storage, receive the voice control command that the user with described UID inputs on the machine, his machine or mobile device, the user vocal feature preserved with described phonetic feature storehouse contrasts;

Wherein, after the user vocal feature in the phonetic feature and server of this user matches, extract this voice control command and carry out the control of video playback based on this voice control command;

Described carry out training to user speech extract phonetic feature and described UID is kept in phonetic feature storehouse by UID based on user together with concrete phonetic feature, specifically comprise:

After pre-service and phonetic feature, obtain the speech characteristic vector parameter of specific user and be stored in the phonetic feature storehouse in the webserver together with the UID of user;

The voice control command that the described user with described UID inputs on the machine, his machine or mobile device, contrasts with the user vocal feature of described preservation, specifically comprises:

Each instruction voice feature corresponding with the UID of this user stored in phonetic feature storehouse for the voice control command of follow-up for the user with described UID input is carried out similarity measure, judges whether the voice control command of user mates the feature in phonetic feature storehouse;

Wherein, user, in viewing process, needs first to say certain function word in classical Chinese writings facing to microphone;

Speech recognition program is after identification function word in classical Chinese writings, and the operational order in special time is considered to effective, and identifies, mates corresponding operational order, and makes a response;

Wherein, after speech recognition program identification function word in classical Chinese writings special time, if do not identify voice control command, so again carry out loitering phase, at this time need again to say function word in classical Chinese writings to microphone, just can carry out Voice command afterwards;

Wherein, when user B attempts to use the UID of user A to carry out speech recognition, provide instruction after click starts to play, the phonetic feature of the UID of server search user A, find that the phonetic feature of the user A that the phonetic feature of this phonetic order is corresponding with UID in special sound feature database does not mate, then provide information, prompting user B registers or logs in the account of oneself, then carries out the operation of speech recognition.

2. the video control method based on speech recognition according to claim 1, is characterized in that, described key characterization parameter adopts MFCC parameter.

3. the video control method based on speech recognition according to claim 1, is characterized in that, described video control method, based in FLASH player, wherein, also comprises:

4., based on a video control system for speech recognition, comprising:

Phonetic feature training unit, during for not carrying out phonetic feature training as user, training is carried out to the voice of user and extracts phonetic feature, and described UID preserved together with concrete phonetic feature in phonetic feature storehouse in the server based on the UID of user, wherein, this phonetic feature storehouse is the database in an Internet Server;

Phonetic feature recognition unit, for after the phonetic feature of UID and user described in phonetic feature library storage, receive the voice control command that the user with described UID inputs on the machine, his machine or mobile device, the user vocal feature preserved with described phonetic feature storehouse contrasts;

Described phonetic feature training unit, specifically comprises:

Feature see training subelement, for obtaining the training utterance of several times control command of user's input; After pre-service and phonetic feature, obtain the speech characteristic vector parameter of specific user;

Described phonetic feature recognition unit, specifically comprises:

Contrast subunit, for each instruction voice feature corresponding with the UID of this user stored in phonetic feature storehouse for the voice control command of follow-up for the user with described UID input is carried out similarity measure, judge whether the voice control command of user mates the feature in phonetic feature storehouse;

Send subelement, for above-mentioned speech characteristic vector parameter being stored in the phonetic feature storehouse in the webserver together with the UID of user, wherein, user, in viewing process, needs first to say certain function word in classical Chinese writings facing to microphone;

Wherein, after speech recognition program identification function word in classical Chinese writings special time, if do not identify voice control command, so again enter loitering phase, at this time need again to say function word in classical Chinese writings to microphone, just can carry out Voice command afterwards;

5. the video control system based on speech recognition according to claim 4, is characterized in that, described key characterization parameter adopts MFCC parameter.

6. the video control system based on speech recognition according to claim 4, described video control unit also comprises:

FLASH player subelement;

Player controls subelement, for completing the identification step of corresponding user speech control command in 10 seconds, carries out corresponding video control action returning successfully.