CN104978960A

CN104978960A - Photographing method and device based on speech recognition

Info

Publication number: CN104978960A
Application number: CN201510374888.XA
Authority: CN
Inventors: 陈包容
Original assignee: Individual
Current assignee: Individual
Priority date: 2015-07-01
Filing date: 2015-07-01
Publication date: 2015-10-14

Abstract

The invention discloses a photographing method and device based on speech recognition. The method comprises the following steps: obtaining photographing code words input by a user; carrying out feature extraction on photographing code words so as to obtain a characteristic vector of the photographing code words; calculating the matching values between the characteristic vector of the photographing code words and hidden markov models of the photographing code words in a sample library; judging whether each matching value is smaller than a set value or not, if yes, establishing the markov models of the photographing code words and storing the markov models in the sample library; and if no, executing photographing action. The method and device disclosed by the invention have the advantages as follows: the technical problem that pre recording is required for the same photographing code words of multiple users is solved, that the same photographing code words are only required to be recorded at one time is realized, photographing efficiency based on speech recognition is improved and user experience is improved.

Description

A kind of method of taking pictures based on speech recognition and device

Technical field

The present invention relates to technical field of voice recognition, be specifically related to a kind of method of taking pictures based on speech recognition and device.

Background technology

Along with smart mobile phone, camera etc. use more and more extensive in daily life, as being that the scenery liked or personage take pictures by smart mobile phone or camera.Existing taking pictures is generally started by pressing physical button to take pictures, or is taken pictures by the virtual key startup on screen, adopts above-mentioned two kinds of modes to take pictures and has time delay of taking pictures, not only press inconvenience, and poor effect of taking pictures.

For this problem, the patent No. is that 201220601960.X proposes a kind of method utilizing speech recognition to take pictures, the method is by the different code word of taking pictures of prerecording, and whether the code word of taking pictures judging in the process of taking pictures that user adopts is the code word of taking pictures of prerecording, and when the code word of taking pictures judging that user adopts is consistent with the code word of taking pictures of prerecording, controls photographing device and perform action of taking pictures.But the method all needs prerecording again for each code word of taking pictures, when the different users that takes pictures adopt same take pictures code word time, system can not the code word of taking pictures of other user's prerecordings of taking pictures of Auto-matching, thus the efficiency that causes taking pictures is not high, and Consumer's Experience is not good.Such as, when adopting code word of taking pictures that self-defined code word of taking pictures is " cheese " to control capture apparatus automatic camera per family when there being 100 use, these hundred use need prerecording to take pictures code word for the code word of taking pictures of " cheese " per family.

Summary of the invention

The invention provides a kind of method of taking pictures based on speech recognition and device, with solve existing take pictures based on speech recognition time, different users adopts same code word of taking pictures all to need prerecording to take pictures the technical matters of code word.

According to an aspect of the present invention, provide a kind of method of taking pictures based on speech recognition, comprising:

Obtain the code word of taking pictures of user's input;

To taking pictures, code word carries out feature extraction, obtains the eigenvector of code word of taking pictures;

Matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse;

Judge whether each matching value is less than setting value, if so, then set up the Hidden Markov Model (HMM) of code word of taking pictures, and be kept in Sample Storehouse, if not, then perform action of taking pictures.

Further, the matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse comprises:

Matching value between the Hidden Markov Model (HMM) being calculated each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse by Viterbi recognizer.

Further, after obtaining the code word of taking pictures of user's input, also comprise before feature extraction is carried out to code word of taking pictures:

To taking pictures, code word carries out pre-service, and pre-service comprises power amplification, one or more in gain control and high-pass filtering.

Further, set up the Hidden Markov Model (HMM) of code word of taking pictures, and be kept at Sample Storehouse and comprise:

Send to user and whether agree to share instruction, and set up the Hidden Markov Model (HMM) of code word of taking pictures after instruction is shared in the agreement receiving user's transmission, and be kept in Sample Storehouse.

Further, the eigenvector of code word of taking pictures is the Mel frequency cepstrum coefficient of code word of taking pictures.

Further, code word of taking pictures be in mandarin, dialect, accent any one or multiple.

According to a further aspect in the invention, provide a kind of phonetic controller, comprising:

Acquisition device, for obtaining the code word of taking pictures of user's input;

Eigenvector extraction element, for carrying out feature extraction to code word of taking pictures, obtains the eigenvector of code word of taking pictures;

Matching value calculation element, for calculate each code word of taking pictures in the eigenvector of code word of taking pictures and Sample Storehouse Hidden Markov Model (HMM) between matching value;

Judgment means, for judging whether each matching value is less than setting value, if so, then setting up the Hidden Markov Model (HMM) of code word of taking pictures, and being kept in Sample Storehouse, if not, then performs action of taking pictures.

Further, matching value calculation element comprises:

Viterbi recognizer calculation element, for calculated each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse by Viterbi recognizer Hidden Markov Model (HMM) between matching value.

Further, the device of taking pictures based on speech recognition also comprises:

Pretreatment unit, for carrying out pre-service to code word of taking pictures, pre-service comprises power amplification, one or more in gain control and high-pass filtering.

Further, judgment means also comprises:

Whether instruction sending device, agree to share instruction for sending to user, and set up the Hidden Markov Model (HMM) of code word of taking pictures after instruction is shared in the agreement receiving user's transmission, and be kept in Sample Storehouse.

The present invention has following beneficial effect:

The invention provides a kind of method of taking pictures based on speech recognition and device, by obtaining the code word of taking pictures of user's input; To taking pictures, code word carries out feature extraction, obtains the eigenvector of code word of taking pictures; Matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse; Judge whether each matching value is less than setting value, if, then set up the Hidden Markov Model (HMM) of code word of taking pictures, and be kept in Sample Storehouse, if not, then perform action of taking pictures, solve multiple user all needs prerecording technical matters for same code word of taking pictures, achieve and only need carry out a prerecording for same code word of taking pictures, improve the efficiency of taking pictures based on speech recognition, improve Consumer's Experience.

Except object described above, feature and advantage, the present invention also has other object, feature and advantage.Below with reference to figure, the present invention is further detailed explanation.

Accompanying drawing explanation

The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the method flow diagram that the speech recognition of the preferred embodiment of the present invention is taken pictures;

Fig. 2 method flow diagram that to be the preferred embodiment of the present invention take pictures for the speech recognition of a scene of taking pictures;

Fig. 3 is the apparatus structure schematic diagram that the speech recognition of the preferred embodiment of the present invention is taken pictures.

Description of reference numerals:

10, acquisition device; 20, eigenvector extraction element; 30, matching value calculation element; 40, judgment means.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the invention are described in detail, but the multitude of different ways that the present invention can be defined by the claims and cover is implemented.

With reference to Fig. 1, the preferred embodiments of the present invention provide a kind of method of taking pictures based on speech recognition, comprising:

Step S101, obtains the code word of taking pictures of user's input;

Step S102, to taking pictures, code word carries out feature extraction, obtains the eigenvector of code word of taking pictures;

Step S103, the matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse;

Step S104, judges whether each matching value is less than setting value, if so, then sets up the Hidden Markov Model (HMM) of code word of taking pictures, and is kept in Sample Storehouse, if not, then performs action of taking pictures.

Method of taking pictures based on speech recognition of the present invention, by obtaining the code word of taking pictures of user's input; To taking pictures, code word carries out feature extraction, obtains the eigenvector of code word of taking pictures; Matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse; Judge whether each matching value is less than setting value, if, then set up the Hidden Markov Model (HMM) of code word of taking pictures, and be kept in Sample Storehouse, if not, then perform action of taking pictures, solve multiple user all needs prerecording technical matters for same code word of taking pictures, achieve and only need carry out a prerecording for same code word of taking pictures, improve the efficiency of taking pictures based on speech recognition, improve Consumer's Experience.

Hidden Markov Model (HMM) (the Hidden Markov Models of the present embodiment, referred to as HMM) be a dual random process: one is reused in the statistical nature (transient state characteristic of signal can directly observe) of short-term stationarity section describing non-stationary signal; It is how to be converted to next short-term stationarity section that another heavy stochastic process describes each short-term stationarity section, i.e. the dynamic perfromance (lying in observation sequence) of statistical nature in short-term.The speech process of people is also so a kind of dual random process, and the production process therefore using Hidden Markov Model (HMM) (HMM) to describe voice signal is point-device.

Alternatively, the matching value between the Hidden Markov Model (HMM) calculating each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse comprises: the matching value between the Hidden Markov Model (HMM) being calculated each code word of taking pictures in the eigenvector of code word of taking pictures and Sample Storehouse by Viterbi recognizer.

Alternatively, after obtaining the code word of taking pictures of user's input, also comprise: to taking pictures, code word carries out pre-service to code word of taking pictures before carrying out feature extraction, pre-service comprises power amplification, one or more in gain control and high-pass filtering.

Generally before to Speech processing, must carry out digitizing to it, this process is exactly mould/number (A/D) conversion.Mould/number conversion process through over-sampling and will quantize two processes, thus obtains the discrete digital signal in time and amplitude.According to nyquist sampling law, general sample frequency is more than the twice of original signal frequency, and just can make in sampling process can not drop-out, and from sampled signal, can reconstruct the waveform of original signal accurately.After the present embodiment carries out mould/number (A/D) conversion to code word of taking pictures, further power amplification is carried out, from gain control or high-pass filtering to code word of taking pictures, wherein the object of high-pass filtering is filtering low-frequency disturbance, especially the Hz noise of 50Hz or 60Hz, thus promote the HFS useful to speech recognition, allow the frequency spectrum of signal become smooth, thus be convenient to carry out spectrum analysis or channel parameters analysis.

Because voice signal is a kind of non-stationary signal, there is time varying characteristic, but in a short time range (it is generally acknowledged at 10-30ms), its characteristic remains unchanged substantially, thus a metastable state process can be seen as, therefore voice signal framing operation can be carried out.Frame number generally per second is about 33-100 frame, depends on the circumstances.Framing can adopt the method for contiguous segmentation, but generally will adopt the method for overlapping segmentation, and this seamlessly transits between frame and frame to make, and keeps continuity.The overlapping part of former frame and a rear frame is called that frame moves.Frame moves and is generally taken as 0-0.5 with the ratio of frame length.Due to the effect of framing, make signal originally become sectional, this is the equal of just in time domain, add a rectangular window at original signal.In time domain, the Fourier transform be multiplied also with regard to being equivalent to signal spectrum and rectangular window in frequency domain carries out convolution with rectangular window.This can change the frequency spectrum of original signal.The process of a windowing to be done for this reason to each frame after framing, thus obtain windowing voice signal.The present embodiment adopts Hamming window windowed function to carry out windowing process to it to the code word signal of taking pictures after high-pass filtering.

Alternatively, set up the Hidden Markov Model (HMM) of code word of taking pictures, and be kept at Sample Storehouse and comprise: send to user and whether agree to share instruction, and receiving the Hidden Markov Model (HMM) setting up code word of taking pictures after instruction is shared in agreement that user sends, and be kept in Sample Storehouse.Whether agreeing to share instruction by sending to user, respecting fully the wish that code word of taking pictures that whether user determine to be inputted carries out sharing, improve user and to take pictures experience.

Alternatively, the eigenvector of code word of taking pictures is the Mel frequency cepstrum coefficient of code word of taking pictures.The characteristic parameter extraction of voice signal has multiple method, linear predictor coefficient (LPC) based on sound pronunciation mechanism, description be tract characteristics; Linear prediction residue error (LPCC) is the parameter based on LPC synthesis.But these two kinds of parameters all do not make full use of the auditory properties of people's ear.The auditory system of people is also a special nonlinear system in fact, and it is different to the susceptibility of the signal of different frequency, is a logarithmic relationship substantially.The present embodiment adopts Mel frequency cepstrum coefficient (MFCC) to extract the characteristic parameter of voice signal, and Mel frequency cepstrum coefficient (MFCC) is a kind of characteristic parameter that can reflect human auditory system mechanism very well.

Alternatively, code word of taking pictures is any one in mandarin, dialect, accent.By arranging dissimilar code word of taking pictures, enriching in Sample Storehouse code word kind of taking pictures, having added the playability and interest of taking pictures.

With reference to Fig. 2, below the present embodiment for a scene of taking pictures, the method for taking pictures based on speech recognition is described in more detail.

To take pictures scene: party A-subscriber carries out voice by the code word of taking pictures that content is " cheese " and takes pictures (not having content to be the Hidden Markov Model (HMM) that the code word of taking pictures of " cheese " is corresponding in Sample Storehouse), in the present embodiment, party A-subscriber realizes the detailed process that voice take pictures and is:

Step S201, obtains the code word of taking pictures of user's input.Be specially the voice signal that content is the code word of taking pictures of " cheese ".The present embodiment is before code word is taken pictures in user's input, and can select microphone type in advance, (such as ear microphone, system carries microphone or other microphone etc.) also arranges microphone (such as regulating the volume etc. of microphone).

Step S202, carries out pre-service to voice signal, specifically comprises: be first converted to digital signal by A/D; Then adopt single order Hi-pass filter to carry out high-pass filtering process to this digital signal, the Hi-pass filter that the present embodiment adopts is single order Hi-pass filter; Finally carry out windowing process to the digital signal after filtering process, the windowing process function that the present embodiment adopts is Hamming window windowed function.

Step S203, calculates the Mel frequency cepstrum coefficient of voice signal, and it can be used as the eigenvector of this voice signal.Be specially and first the frequency axis of the frequency spectrum of voice signal be transformed to Mel frequency scale, and then transform to cepstrum domain thus the cepstrum coefficient obtained (MFCC).In the present embodiment, the corresponding conversion of Mel frequency scale and frequency is closed and is: , wherein, for actual line resistant frequency, for Mei Er frequency.

Step S204, Computed-torque control is the matching value between the Hidden Markov Model (HMM) of each code word of taking pictures in the eigenvector of the code word of taking pictures of " cheese " and Sample Storehouse.The present embodiment calculates the matching value between the Hidden Markov Model (HMM) of each code word of taking pictures in the take pictures eigenvector of code word and Sample Storehouse by Viterbi recognizer.

Step S205, judges whether each matching value is less than setting value, if so, then sends whether agree to that sharing contents is the code word instruction of taking pictures of " cheese " to user, if not, then performs action of taking pictures.Owing to there is not the Hidden Markov Model (HMM) that content is the code word of taking pictures of " cheese " in the Sample Storehouse of the present embodiment, and supposition content be each code word of taking pictures in the eigenvector of the code word of taking pictures of " cheese " and Sample Storehouse Hidden Markov Model (HMM) between matching value be less than setting value, therefore the present embodiment is after judgement, performs and send whether agree to that sharing contents is the code word instruction of taking pictures of " cheese " to user.

Step S206, sets up the Hidden Markov Model (HMM) that content is the code word of taking pictures of " cheese ", and is kept in Sample Storehouse after instruction is shared in the agreement receiving user's transmission.

The present embodiment is after the Hidden Markov Model (HMM) of code word of taking pictures of " cheese " by setting up the content that A adopts in Sample Storehouse, if other users adopt content to be that the code word of taking pictures of " cheese " carries out voice when taking pictures again, without the need to the code word of taking pictures that prerecording content is " cheese ", namely content can be directly adopted to be that the code word of taking pictures of " cheese " completes and takes pictures, it should be noted that, the same server of Application sharing of taking pictures of other users in the present embodiment and party A-subscriber.The present embodiment sets up content, and to be the detailed process of the Hidden Markov Model (HMM) of the code word of taking pictures of " cheese " be: the parameter sets λ=(π first defining Hidden Markov Model (HMM) (HMM), A, C, μ, U), wherein π is initial state distribution probability, A is state transition probability, C is hybrid gain matrix, and μ is the Mean Matrix of mixed components, and U is the covariance matrix of mixed components; Then forward-backward algorithm algorithm is adopted to calculate the probability producing observation sequence under the condition of setting models λ; Then adopt Viterbi algorithms selection based on the status switch of corresponding under given observation sequence and setting models condition best (explanation observation sequence that can be best); Finally adopt Baum-Welch algorithm adjustment model parameter lambda=(π, A, C, μ, U), to make the maximum probability producing observation sequence under the condition of setting models λ, and the Hidden Markov Model (HMM) adjusting model parameter is kept in Sample Storehouse.

Adopt the method for the present embodiment, solve multiple user all needs prerecording technical matters for same code word of taking pictures, achieve and only need carry out a prerecording for same code word of taking pictures, improve the efficiency of taking pictures based on speech recognition, improve Consumer's Experience.

With reference to figure 3, according to a further aspect in the invention, provide a kind of device of taking pictures based on speech recognition, comprising:

Acquisition device 10, for obtaining the code word of taking pictures of user's input;

Eigenvector extraction element 20, for carrying out feature extraction to code word of taking pictures, obtains the eigenvector of code word of taking pictures;

Matching value calculation element 30, for calculate each code word of taking pictures in the eigenvector of code word of taking pictures and Sample Storehouse Hidden Markov Model (HMM) between matching value;

Judgment means 40, for judging whether each matching value is less than setting value, if so, then setting up the Hidden Markov Model (HMM) of code word of taking pictures, and being kept in Sample Storehouse, if not, then performs action of taking pictures.

Alternatively, matching value calculation element 30 comprises:

Alternatively, the device of taking pictures based on speech recognition also comprises:

Whether alternatively, judgment means 40 also comprises: instruction sending device, agree to share instruction for sending to user, and sets up the Hidden Markov Model (HMM) of code word of taking pictures after instruction is shared in the agreement receiving user's transmission, and is kept in Sample Storehouse.

The specific works process of the device of taking pictures based on speech recognition of the present embodiment and principle of work can refer to the course of work and the principle of work of the method for taking pictures based on speech recognition in the present embodiment.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. based on the method that speech recognition is taken pictures, it is characterized in that, comprising:

Obtain the code word of taking pictures of user's input;

Feature extraction is carried out, the eigenvector of code word of taking pictures described in acquisition to described code word of taking pictures;

Take pictures described in calculating each code word of taking pictures in the eigenvector of code word and Sample Storehouse Hidden Markov Model (HMM) between matching value;

Judge whether matching value described in each is less than setting value, the Hidden Markov Model (HMM) of code word of taking pictures described in if so, then setting up, and be kept in described Sample Storehouse, if not, then perform action of taking pictures.

2. method of taking pictures based on speech recognition according to claim 1, is characterized in that, in the eigenvector of code word of taking pictures described in calculating and Sample Storehouse each code word of taking pictures Hidden Markov Model (HMM) between matching value comprise:

Take pictures described in being calculated by Viterbi recognizer each code word of taking pictures in the eigenvector of code word and Sample Storehouse Hidden Markov Model (HMM) between matching value.

3. method of taking pictures based on speech recognition according to claim 2, is characterized in that, after obtaining the code word of taking pictures of user's input, also comprises before carrying out feature extraction to described code word of taking pictures:

Carry out pre-service to described code word of taking pictures, described pre-service comprises power amplification, one or more in gain control and high-pass filtering.

4. method of taking pictures based on speech recognition according to claim 3, is characterized in that, the Hidden Markov Model (HMM) of code word of taking pictures described in foundation, and is kept at described Sample Storehouse and comprises:

Send to described user and whether agree to share instruction, and the Hidden Markov Model (HMM) of code word of taking pictures described in setting up after instruction is shared in the agreement receiving described user transmission, and be kept in described Sample Storehouse.

5. method of taking pictures based on speech recognition according to claim 4, is characterized in that,

The eigenvector of described code word of taking pictures to be taken pictures the Mel frequency cepstrum coefficient of code word described in being.

6., according to the arbitrary described method of taking pictures based on speech recognition of claim 1-5, it is characterized in that,

Described code word of taking pictures be in mandarin, dialect, accent any one or multiple.

7. based on the device that speech recognition is taken pictures, it is characterized in that, comprising:

Acquisition device (10), for obtaining the code word of taking pictures of user's input;

Eigenvector extraction element (20), for carrying out feature extraction, the eigenvector of code word of taking pictures described in acquisition to described code word of taking pictures;

Matching value calculation element (30), for each code word of taking pictures in the eigenvector of code word of taking pictures described in calculating and Sample Storehouse Hidden Markov Model (HMM) between matching value;

Judgment means (40), for judging whether matching value described in each is less than setting value, the Hidden Markov Model (HMM) of code word of taking pictures described in if so, then setting up, and be kept in described Sample Storehouse, if not, then perform action of taking pictures.

8. device of taking pictures based on speech recognition according to claim 7, is characterized in that, described matching value calculation element (30) comprising:

Viterbi recognizer calculation element, for each code word of taking pictures in take pictures described in being calculated by Viterbi the recognizer eigenvector of code word and Sample Storehouse Hidden Markov Model (HMM) between matching value.

9. device of taking pictures based on speech recognition according to claim 8, is characterized in that, described device of taking pictures based on speech recognition also comprises:

Pretreatment unit, for carrying out pre-service to described code word of taking pictures, described pre-service comprises power amplification, one or more in gain control and high-pass filtering.

10. device of taking pictures based on speech recognition according to claim 9, is characterized in that, described judgment means (40) also comprises:

Whether instruction sending device, agree to share instruction for sending to described user, and the Hidden Markov Model (HMM) of code word of taking pictures described in setting up after instruction is shared in the agreement receiving described user transmission, and be kept in described Sample Storehouse.