CN103680497B

CN103680497B - Speech recognition system and method based on video

Info

Publication number: CN103680497B
Application number: CN201210320742.3A
Authority: CN
Inventors: 王玲珑; 曹晨曦
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-08-31
Filing date: 2012-08-31
Publication date: 2017-03-15
Anticipated expiration: 2032-08-31
Also published as: CN103680497A

Abstract

The present invention proposes a kind of speech recognition system based on video, including：Terminal unit, for recording or receiving video and gather the voice signal in video；Cloud Server, for receiving the voice signal from terminal unit, extracts the voiceprint in voice signal, voiceprint is mated to obtain the identity information of the voice signal person of sending with the voiceprint of the multiple users in the vocal print storehouse for prestoring；Social interaction server device, for receiving the identity information of video and the person of sending, searches the identity recognition number that the person of sending is registered in social interaction server device according to the person's of sending identity information, and sends video according to identity recognition number to the corresponding voice signal person of sending.Invention additionally discloses a kind of audio recognition method based on video.The present invention knows the identity information of user by recognizing vocal print, by the identity information coupling of user after, for, accurate by Information Sharings such as videos to other side.

Description

Speech recognition system and method based on video

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of speech recognition system and side based on video Method.

Background technology

Speech recognition technology has been widely used for, among people's daily life, bringing a lot of problems therewith.For example, exist How speech recognition technology is applied in account system or SNS Related products, so as to letters such as efficient, accurate transmission or sharing video frequencies Cease to other side.The multiple contact person good friends of human brain memory are needed in account system instantly and SNS Related products, through the tired of practice Product, it is easy to forget once met but the friend extremely not known well, and the good friends in user thinks sharing information to video When, the information identity that can not remember good friend is found, comparison is awkward.Memory that these problems can only pass through user itself is solved at present Realize with manual analyzing, efficiency is low, accuracy is low.

Content of the invention

It is contemplated that at least solving one of above-mentioned technical problem.

For this purpose, it is an object of the present invention to proposing a kind of speech recognition system based on video, the system can be led to Cross speech recognition, the convenient and accurate identity for passing through the user in speech recognition video.Further object is that Propose a kind of control device of terminal unit.

To achieve these goals, the embodiment of first aspect present invention provides a kind of Mobile terminal control system, including Following steps：Terminal unit, for recording or receiving video, and gathers the voice signal in the video；Cloud Server, is used for The voice signal from the terminal unit is received, the voiceprint in the voice signal is extracted, and by the sound Stricture of vagina information is mated to obtain the person of sending of the voice signal with the voiceprint of the multiple users in the vocal print storehouse for prestoring Identity information, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described voiceprint Correspond with the identity information；And social interaction server device, for receiving the identity information of the video and the person of sending, The identity recognition number that the person of sending according to the identity information of the person of sending is searched is registered in the social interaction server device, and root The video is sent according to the identity recognition number to the person of sending of the corresponding voice signal.

Terminal unit control system according to embodiments of the present invention, prestoring in voice and vocal print storehouse that user is sent Voice is mated, and after the match is successful, and user carries out confirming selection and control, by Information Sharings such as videos to other side, from The selection to terminal unit be can achieve without other external equipments to control, process is accurately easily realized, with higher standard Really property, ease for use and the suitability.

In one embodiment of the invention, the voiceprint includes multiple vocal print features, wherein, the vocal print feature Including acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.

In yet another embodiment of the present invention, the language feature includes languages feature, provincialism and accent feature In one or more.

Thus, Cloud Server can be mated to the voice from terminal unit by vocal print feature, various informative property, So as to consider language feature as much as possible, more conducively the identity of the person of sending of voice is identified.

In the present invention in one embodiment, the terminal unit is additionally operable to the voice signal to collecting to be carried out Noise reduction process, and by noise reduction process after voice signal send to the Cloud Server.

As a result, the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user is confirmed and is controlled.

In another embodiment of invention, the identity recognition number that the person of sending is registered in the social interaction server device is E-mail address or instant chat ID.

Thus, by registration E-mail address used or the ID that chats in time, the relevant person of sending just can easily be obtained more Identity information, so as to video is sent to the person of sending, and be easy to accuracy and the safety of safeguards system.

The embodiment of second aspect present invention proposes a kind of audio recognition method based on video, comprises the steps：Eventually End equipment is recorded or receives video, and gathers the voice signal in the video, and the voice signal is sent to cloud clothes Business device；

Voice signal described in the cloud server, and extract the voiceprint in the voice signal, and by institute State voiceprint and mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring, obtain sending for the voice signal The identity information of person, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described vocal print letter Breath is corresponded with the identity information；And

Social interaction server device receives the identity information of the person of sending of the video and the voice signal, and is sent according to described The identity information of person searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identity Identifier sends the video to the person of sending of the corresponding voice signal.

Audio recognition method based on video according to embodiments of the present invention, in the voice that user is sent and vocal print storehouse The voice for prestoring is mated, and after the match is successful, user carries out confirming selection and control, by Information Sharings such as videos to right Side, can achieve the selection to terminal unit without other external equipments and controls, and process is accurately easily realized, with higher Accuracy, ease for use and the suitability.

In another embodiment of the present invention, the language feature is included in languages feature, provincialism and accent feature One or more.

In yet another embodiment of the present invention, after voice signal of the terminal unit in the video is collected, Also comprise the steps：Carry out noise reduction process to the voice signal, and by noise reduction process after voice signal send to described Cloud Server.

So that the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user is confirmed and is controlled.

In one embodiment of the invention, the identity recognition number that the person of sending is registered in the social interaction server device is E-mail address or instant chat ID.

Thus, the person of sending passes through registered E-mail address or instant chat ID obtains identity recognition number, can be multipath The person of sending is provided related identity information, so as to video is sent to the person of sending, and is easy to accuracy and the peace of safeguards system Quan Xing.

The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.

Description of the drawings

The above-mentioned and/or additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments Substantially and easy to understand, wherein,

Fig. 1 is the structure chart of the speech recognition system based on video according to one embodiment of the invention；

Fig. 2 is the flow chart of the audio recognition method based on video according to one embodiment of the invention；And

Fig. 3 is the flow process that the audio recognition method according to the user of one embodiment of the invention based on video selects good friend Figure.

Specific embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from start to finish Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and it is limitation of the present invention not to be understood that.Conversely, this Inventive embodiment includes all changes, modification and the equivalent fallen in the range of the spirit and intension of attached claims Thing.

In describing the invention, it should be noted that unless otherwise clearly defined and limited, term " being connected ", " company Connect " should be interpreted broadly, for example：Can be fixedly connected, it is also possible to make to be detachably connected, or be integrally connected；Machine can be made Tool connection, or electrical connection；Can make to be joined directly together, it is also possible to be indirectly connected to by intermediary.For this area For those of ordinary skill, above-mentioned term concrete meaning in the present invention can be understood with concrete condition.Additionally, the present invention's In description, unless otherwise stated, " multiple " are meant that two or more.

It is the speech recognition system based on video for describing the embodiment of the present invention below with reference to Fig. 1.

As shown in figure 1, the speech recognition system 1000 based on video of the embodiment of the present invention, including：Terminal unit 100, Cloud Server 200 and social interaction server device 300.

Video can be recorded or be received to terminal unit 100, and gather the voice signal in video.

In one embodiment of the invention, terminal unit can be that mobile terminal or panel computer etc. have mobile communication The equipment of function, for example：Mobile phone, ipad, PC（Personal Computer, PC）Or taking pictures with communication function Equipment etc..Terminal unit can voluntarily record an audio or video fragment or other approach receive an audio frequency from network etc. Or video segment.

As terminal unit is recorded or to receive noise signal in video more, situations such as easily form noise, unfavorable Voice signal in video is analyzed, it is therefore desirable to carry out noise reduction process to voice signal.

In one embodiment of the invention, after terminal unit 100 collects the voice signal in video, terminal unit 100 further carry out noise reduction process to voice signal, and by noise reduction process after voice signal send to Cloud Server 200.By This so that the voice signal of acquisition becomes apparent from, and the more convenient voice messaging to user is confirmed and controlled.

Cloud Server 200 receives the voice signal from the terminal unit 100, extracts the vocal print letter in voice signal Breath.Wherein, voiceprint includes multiple vocal print features：Acoustic featuress, lexical characteristics, prosodic features, language feature and passage are special Levy.

Separately below various vocal print features are described.

（1）Acoustic featuress, such as cepstrum.Cepstrum refers to the new frequency spectrum after the Fourier transform that signal spectrum is taken the logarithm Signal；

（2）Lexical characteristics, such as speaker related word n-gram, phoneme n-gram；

（3）Prosodic features, the fundamental tone for for example being described using n-gram and energy " posture "；

（4）Language feature, wherein language feature include the one kind or many in languages feature, provincialism and accent feature again Kind.Thus, Cloud Server can be mated to the voice from terminal unit by vocal print feature, various informative property, so as to Language feature as much as possible is considered, more conducively the identity of the person of sending of voice is identified.

（5）Channel information, for example, used which kind of passage etc..

After voiceprint of the Cloud Server 200 in voice signal is extracted, by voiceprint and the vocal print storehouse for prestoring The voiceprint of multiple users mated to obtain the identity information of the person of sending of voice signal.

In identity information and voiceprint that vocal print stock contains multiple users, wherein voiceprint is one with identity information One is corresponding.As vocal print has a uniqueness, Cloud Server 200 currently sends voice by comparing the vocal print of voice and can know User whether be user itself.

Specifically, multi-path voice signal can be included in video, the person of sending per road voice signal is respectively different.Due to The voiceprint of each person of sending is different, by by the vocal print of the vocal print feature that extracts in voice signal and the multiple users for prestoring Information is mated, you can knows that the road voice signal is sent by which user, that is, knows sending for the road voice signal Person.

Social interaction server device 300 receives the video from terminal unit 100 and the letter of the identity from 200 person of sending of Cloud Server Breath, searches the identity recognition number that the person of sending is registered in social interaction server device 300 according to the identity information of the person of sending.

In one embodiment of the invention, the identity recognition number that the person of sending is registered in social interaction server device 300 can be E-mail address（email）Or instant chat ID.

Social interaction server device 300 sends the video according to above-mentioned identity recognition number to the person of sending of corresponding voice signal.By This, by registration E-mail address used or the ID that chats in time, just can easily obtain the more identity informations of the relevant person of sending, So as to video is sent to the person of sending, and it is easy to accuracy and the safety of safeguards system.

Of the invention is carried out based on the speech recognition system 1000 of video so that video includes three users as an example below Explanation.

User U passes through 100 recorded video S of terminal unit.Terminal unit 100 is adopted to the voice signal V in video S Collection, and enters V row noise reductions to the voice signal for collecting, then by noise reduction after voice signal V be sent to Cloud Server 200.

Cloud Server 200 extracts the voiceprint in the voice signal after voice signal V is received, wherein in the sound Three kinds of different vocal print features, respectively A, B and C are included in stricture of vagina information.Cloud Server 200 by above-mentioned voiceprint with prestore Vocal print storehouse in the voiceprint of multiple users mated, obtaining matching result is：The corresponding user M of vocal print feature A, vocal print The corresponding user N of feature B, the corresponding user W of vocal print feature C.Such that it is able to know, the corresponding voice signal of vocal print feature A is by user M Send, the corresponding voice signal of vocal print feature B is sent by user N, the corresponding voice signal of vocal print feature C is sent by user W.

Also be stored with Cloud Server 200 identity information of each user.Cloud Server 200 by above-mentioned matching result and The identity information of corresponding user M, N and W is sent to social interaction server device 300.

Social interaction server device 300 searches which in social interaction server device 300 according to the identity information of user M, N and W for receiving Then the video is sent to above-mentioned user M, N and W by the identity recognition number of registration.

Terminal unit control system according to embodiments of the present invention, prestoring in voice and vocal print storehouse that user is sent Voice is mated, and after the match is successful, and user carries out confirming selection and control, by Information Sharings such as videos to other side, from The selection to terminal unit be can achieve without other external equipments to control, process is accurately easily realized, with higher standard Really property, ease for use and the suitability.Also, in order to ensure voice messaging and the voiceprint efficient matchings in vocal print storehouse, in terminal After equipment 100 collects voice signal, carried out noise reduction process, so as to get voice signal become apparent from,

As shown in Fig. 2 the audio recognition method based on video of the embodiment of the present invention, comprises the steps：

Step S201, terminal unit are recorded or receive video and gather the voice signal in video, and voice signal is sent To Cloud Server.Here terminal unit can be the equipment with mobile communication function such as mobile terminal or panel computer, example Such as：Mobile phone, ipad, PC（Personal Computer, PC）Or the photographing device with communication function etc..Terminal sets Standby can voluntarily record an audio or video fragment or other approach receive audio or video fragments from network etc..By Record in terminal unit or to receive noise signal in video more, situations such as easily form noise, in being unfavorable for video Voice signal be analyzed, it is therefore desirable to noise reduction process is carried out to voice signal.And by noise reduction process after voice signal send out Deliver to Cloud Server.As a result, the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user confirmed and Control.

Step S202, cloud server voice signal simultaneously extract the voiceprint in voice signal, by voiceprint with The voiceprint of the multiple users in the vocal print storehouse for prestoring is mated, and obtains the identity information of the person of sending of voice signal.Tool Body ground, can include multi-path voice signal in video, and the person of sending per road voice signal is respectively different.Due to each person of sending Voiceprint different, by by the multiple users' in the vocal print feature that extracts in voice signal and the voiceprint storehouse for prestoring Voiceprint is mated, you can is known that the road voice signal is sent by which user, that is, is known the road voice signal The person of sending.

Step S203, social interaction server device receive the identity information of the person of sending of video and voice signal, and according to the person of sending Identity information search the identity recognition number registered in social interaction server device of the person of sending, according to identity recognition number to corresponding voice The person of sending of signal sends video.Identity recognition number therein can be the E-mail address used by registration（Email）Or and When chat ID described by.Relevant send person more identity information so just easily can be obtained, so as to video to be sent to The person of sending, and it is easy to accuracy and the safety of safeguards system.

Due to a complicated physiology physical process between the production Body Languages maincenter and phonatory organ of human language, The voiceprint map of any two people is all variant.The existing relative stability of everyone speech acoustics feature, but variant, no Be absolute, unalterable.It is used as recognizing password by making full use of everyone phonatory organ this uniqueness, uses Family is more convenient to use whenever and wherever possible more naturally.

From the feature bag that the angle that can be modeled using mathematical method, vocal print automatic identification model can be used at present Include：

（1）Acoustic featuress（Cepstrum）；

（2）Grammar property（Speaker related word n-gram, phoneme n-gram）；

（3）Prosodic features（The fundamental tone described using n-gram and energy " posture "）；

（4）Languages, dialect and accent information；

（5）Channel information（Using which kind of passage）；Etc..

If the voice that user sends and the voice match in vocal print storehouse, the vocal print that user sends and semantic and vocal print storehouse In the vocal print that prestores and voice be corresponding.

Vocal print has uniqueness, can know whether the user for currently sending voice is user by comparing the vocal print of voice Itself, is controlled to terminal unit so as to avoiding other people from pretending to be or imitating owner, improves the safety of terminal unit control. Additionally, the action of the desired terminal unit of user can be known by comparing the semanteme of voice, right such that it is able to realize exactly The expectation for controlling and meeting user of terminal unit.

As shown in figure 3, the flow chart that the audio recognition method based on video of the embodiment of the present invention selects good friend, including such as Lower step：

Step S301, sends sound.

Step S302, is detected to acoustic information and noise reduction process, obtains apparent voice messaging.

Step S303, carries out feature extraction according to vocal print feature to a certain voice messaging.The task of feature extraction is to extract And select that there is the acoustics of characteristic or the language features such as separability is strong, stability is high to the vocal print of speaker.With speech recognition not With the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must be " general character for speaker Feature ".Although at present major part Voiceprint Recognition System be all acoustics aspect feature, characterize the spy of a personal touch It is multifaceted to levy.

Step S304, through voiceprint registration, to sound-groove model, this is the basic steps of the Model Matching of vocal print.

Step S305, by the feature extraction of the voice messaging to a certain voice, is directly over vocal print and confirms to be reflected with vocal print Not, arrival mode coupling.Model Matching is a synergistic process with sound-groove model.After Model Matching, just can be true Recognize transmitting terminal to be performed.

For pattern recognition, there are following a few big class methods：

Stencil matching method：Bent using dynamic time（DTW）To be directed at training and test feature sequence, it is mainly used in solid Determine the application of phrase（Usually text inter-related task）；

Arest neighbors method：Retain all characteristic vectors during training, each vector is found in trained vector most during identification Near K, is identified accordingly, and the amount of usual model storage and similar calculating is all very big；

Neural net method：There are a variety of forms, such as multilayer perception, RBF（RBF）Deng, can show training with Speaker and its background speaker is distinguished, its training burden is very big, and the replicability of model is bad；

Hidden Markov model（HMM）Method：The HMM of single state, or gauss hybrid models are usually used（GMM）, it is Popular method, effect are relatively good；

Video sharing is given the good friend by step S306, it is determined that correct information transmitting terminal.

Control by the voice of lookup terminal unit as a example by giving the good friend by video sharing to the embodiment of the present invention below Method is described.A vocal print storehouse is initially set up, the inside is deposited the corresponding voiceprint of each good friend, then carried by feature Take, extract and select that there is to the vocal print of speaker the acoustics of characteristic or the language features such as analyticity is strong, stability is high.With language Sound identification is different, and the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must for speaker Must be " common feature ".Model Matching, from the angle that can be modeled using mathematical method.By sound-groove model, the match is successful Afterwards, personage in locking video, if to match speaker be certain good friend in vocal print storehouse, reads speaker's identity letter Breath, completes to recognize.After the identity information of identification good friend, can be by SNS social networking applications, by video sharing to related good friend.This One technology is solved needs the multiple contact person good friends of human brain memory, through practice accumulation, it is easy to forget once met but The friend extremely not known well, and when user wants the good friends being shared with video, find the information body that can not remember good friend Part, the awkward problem of comparison.This technology is applied in account system or SNS Related products, and future can be used on Baidu space Etc. project etc..

It should be noted that user stores the voice voiceprint of good friend in advance in vocal print storehouse.

The control method of mobile terminal according to embodiments of the present invention, by prestored in the voice of user's transmission and vocal print storehouse Voice is mated, and terminal unit is controlled after the match is successful, be can achieve without other external equipments Control to mobile terminal, process are simply easily realized, with higher ease for use and the suitability.Also, sent using user Voice message is controlled to mobile terminal, is difficult to be imitated by other people and is pretended to be, with higher safety.Additionally, in order to Ensure the voiceprint efficient matchings in voice messaging and vocal print storehouse, will to collect voice signal carries out noise reduction process, make The voice signal for obtaining becomes apparent from, and is easy to sender's analysis to obtain correct video sharing person.

In flow chart or here any process described otherwise above or method description are construed as, expression includes One or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion Point, and the scope of the preferred embodiment of the present invention includes other realization, can not wherein press the suitable of shown or discussion Sequence, including according to involved function by basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Represent in flow charts or here logic described otherwise above and/or step, for example, it is possible to be considered as to use In the order list of the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment（As computer based system, include processor system or other can hold from instruction Row system, device or equipment instruction fetch the system of execute instruction）Use, or with reference to these instruction execution systems, device or set Standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass The dress that defeated program is used for instruction execution system, device or equipment or with reference to these instruction execution systems, device or equipment Put.The more specifically example of computer-readable medium（Non-exhaustive list）Including following：There is the electricity of one or more wirings Connecting portion（Electronic installation）, portable computer diskette box（Magnetic device）, random access memory（RAM）, read only memory （ROM）, erasable edit read-only storage（EPROM or flash memory）, fiber device, and portable optic disk is read-only deposits Reservoir（CDROM）.In addition, computer-readable medium can even is that the paper that can print described program thereon or other are suitable Medium, because for example by carrying out optical scanning to paper or other media edlin, interpretation can then be entered or if necessary with which His suitable method is processed to electronically obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art Any one of row technology or their combination are realizing：There is the logic gates for being used for realizing logic function to data signal Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array（PGA）, scene Programmable gate array（FPGA）Deng.

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Suddenly the hardware that can be by program to instruct correlation is completed, and described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

Additionally, each functional unit in each embodiment of the invention can be integrated in a processing module, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a module.Above-mentioned integrated mould Block both can be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit using in the form of software function module realize and as independent production marketing or use when, it is also possible to be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read only memory, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy described with reference to the embodiment or example Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example are necessarily referred to.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.

Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is in the principle and objective without departing from the present invention In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The scope of the present invention Extremely it is equal to limit by claims.

Claims

1. a kind of speech recognition system based on video, it is characterised in that include：

Terminal unit, for recording or receiving video, and gathers the voice signal in the video；

Cloud Server, for receiving the voice signal from the terminal unit, extracts the vocal print in the voice signal Information, and the voiceprint is mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring described to obtain The identity information of the person of sending of voice signal, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, Wherein described voiceprint is corresponded with the identity information；And

Social interaction server device, for receiving the identity information of the video and the person of sending, believes according to the identity of the person of sending Breath searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identity recognition number to correspondingly The person of sending of the voice signal send the video.

2. the speech recognition system based on video as claimed in claim 1, it is characterised in that the voiceprint includes multiple Vocal print feature, wherein, the vocal print feature includes acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.

3. the speech recognition system based on video as claimed in claim 2, it is characterised in that the language feature includes languages One or more in feature, provincialism and accent feature.

4. the speech recognition system based on video as claimed in claim 1, it is characterised in that it is right that the terminal unit is additionally operable to The voice signal for collecting carries out noise reduction process, and by noise reduction process after voice signal send to the Cloud Server.

5. the speech recognition system based on video as claimed in claim 1, it is characterised in that the person of sending is in the social activity The identity recognition number that registers on server is E-mail address or instant chat ID.

6. a kind of audio recognition method based on video, it is characterised in that comprise the steps：

Terminal unit is recorded or receives video, and gathers the voice signal in the video, and the voice signal is sent To Cloud Server；

Voice signal described in the cloud server, and extract the voiceprint in the voice signal, and by the sound Stricture of vagina information is mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring, and obtains the person's of sending of the voice signal Identity information, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described voiceprint with The identity information is corresponded；And

Social interaction server device receives the identity information of the person of sending of the video and the voice signal, and according to the person's of sending Identity information searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identification Number the video is sent to the person of sending of the corresponding voice signal.

7. the audio recognition method based on video as claimed in claim 6, it is characterised in that the voiceprint includes multiple Vocal print feature, wherein, the vocal print feature includes acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.

8. the audio recognition method based on video as claimed in claim 7, it is characterised in that the language feature includes languages One or more in feature, provincialism and accent feature.

9. the audio recognition method based on video as claimed in claim 6, it is characterised in that the terminal unit is being collected After voice signal in the video, also comprise the steps：Carry out noise reduction process to the voice signal, and by noise reduction process Voice signal afterwards is sent to the Cloud Server.

10. the audio recognition method based on video as claimed in claim 6, it is characterised in that the person of sending is in the society It is E-mail address or instant chat ID to hand over the identity recognition number that registers on server.