CN103680497A

CN103680497A - Voice recognition system and voice recognition method based on video

Info

Publication number: CN103680497A
Application number: CN201210320742.3A
Authority: CN
Inventors: 王玲珑; 曹晨曦
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-08-31
Filing date: 2012-08-31
Publication date: 2014-03-26
Anticipated expiration: 2032-08-31
Also published as: CN103680497B

Abstract

The invention provides a voice recognition system based on video. The system comprises terminal equipment, a cloud server and a social server, wherein the terminal equipment is used for recording or receiving the video and collecting voice signals in the video; the cloud server is used for receiving the voice signals from the terminal equipment, extracting voiceprint information in the voice signals, and matching the voiceprint information with the voiceprint information of multiple users in a prestored voiceprint library so as to acquire the identity information of voice signal senders; the social server is used for receiving the video and the identity information of the senders, searching the identity recognition numbers, which are registered on the social server, of the senders according to the identity information of the senders, and sending the video to the corresponding voice signal senders according to the identity recognition numbers. The invention further discloses a voice recognition method based on the video. The method comprises the following steps: acquiring the identity information of users through the recognition of voiceprint, and accurately sharing the information of the video and the like with each other after the identity information of the users is matched.

Description

Speech recognition system based on video and method

Technical field

The present invention relates to speech recognition technology field, particularly relate to a kind of speech recognition system and method based on video.

Background technology

Speech recognition technology has been widely used in, among people's daily life, having brought a lot of problems thereupon.For example, in account system or SNS Related product, how to apply speech recognition technology, thus send efficiently, accurately or the information such as sharing video frequency to the other side.Instantly in account system and SNS Related product, needing a plurality of contact person good friends of human brain memory, accumulation through practice, the friend who is easy to forget met once but does not extremely know well, and when user thinks sharing information to good friends in video, good friend's information identity is can not remember in discovery, more awkward.Address these problems at present and can only realize by user's self memory and manual analyzing, efficiency is low, and accuracy is low.

Summary of the invention

The present invention is intended at least one of solve the problems of the technologies described above.

For this reason, one object of the present invention is to propose a kind of speech recognition system based on video, and this system can be by speech recognition, convenient and accurately by the identity of the user in speech recognition video.Another object of the present invention is to propose a kind of control device of terminal device.

To achieve these goals, the embodiment of first aspect present invention provides a kind of Mobile terminal control system, comprises the following steps: terminal device, for recording or receiver, video, and gathers the voice signal in described video; Cloud Server, for receiving the described voice signal from described terminal device, extract the voiceprint in described voice signal, and the identity information that described voiceprint is mated obtain the person of sending of described voice signal with the voiceprint of a plurality of users in the vocal print storehouse prestoring, wherein, described vocal print stock contains a plurality of users' identity information and voiceprint, and wherein said voiceprint is corresponding one by one with described identity information; And social server, for receive described video and described in the person's of sending identity information, the identity recognition number that described in searching according to the described person's of sending identity information, the person of sending registers on described social server, and to the person of sending of the described voice signal of correspondence, send described video according to described identity recognition number.

According to the terminal device control system of the embodiment of the present invention, the voice that user is sent mate with the voice that prestore in vocal print storehouse, and after the match is successful, user confirms to select and control, by Information Sharings such as videos to the other side, thereby do not need other external units to realize, the selection of terminal device is not controlled, process accurately easily realizes, and has higher accuracy, ease for use and applicability.

In one embodiment of the invention, described voiceprint comprises a plurality of vocal print features, and wherein, described vocal print feature comprises acoustic feature, lexical characteristics, prosodic features, language feature and channel characteristics.

In yet another embodiment of the present invention, described language feature comprises one or more in languages feature, provincialism and accent feature.

Thus, Cloud Server can mate the voice from terminal device by vocal print feature, various informative property, thus consider language feature as much as possible, be more conducive to the person's of sending of voice identity to identify.

In one embodiment of the present invention, described terminal device is also for the described voice signal collecting is carried out to noise reduction process, and the voice signal after noise reduction process is sent to described Cloud Server.

Thus, make the voice signal of acquisition more clear, the more convenient voice messaging to user is confirmed and controls.

In another embodiment of invention, described in the identity recognition number registered on described social server of the person of sending be E-mail address or instant chat ID.

Thus, by registering E-mail address used or the ID that chats in time, just can easily obtain the relevant more identity information of the person of sending, thereby video is sent to the person of sending, and be convenient to accuracy and the security of safeguards system.

The embodiment of second aspect present invention proposes a kind of audio recognition method based on video, comprises the steps: that terminal device is recorded or receiver, video, and gathers the voice signal in described video, and described voice signal is sent to Cloud Server;

Described Cloud Server receives described voice signal, and extract the voiceprint in described voice signal, and described voiceprint is mated with the voiceprint of a plurality of users in the vocal print storehouse prestoring, obtain the person's of sending of described voice signal identity information, wherein, described vocal print stock contains a plurality of users' identity information and voiceprint, and wherein said voiceprint is corresponding one by one with described identity information; And

Social server receives the person's of sending of described video and described voice signal identity information, and the identity recognition number that described in searching according to the described person's of sending identity information, the person of sending registers on described social server, and to the person of sending of the described voice signal of correspondence, send described video according to described identity recognition number.

According to the audio recognition method based on video of the embodiment of the present invention, the voice that user is sent mate with the voice that prestore in vocal print storehouse, and after the match is successful, user confirms to select and control, by Information Sharings such as videos to the other side, thereby do not need other external units to realize, the selection of terminal device is not controlled, process accurately easily realizes, and has higher accuracy, ease for use and applicability.

In another embodiment of the present invention, described language feature comprises one or more in languages feature, provincialism and accent feature.

In another embodiment of the present invention, after the voice signal of described terminal device in collecting described video, also comprise the steps: described voice signal to carry out noise reduction process, and the voice signal after noise reduction process is sent to described Cloud Server.

Make thus the voice signal of acquisition more clear, the more convenient voice messaging to user is confirmed and controls.

In one embodiment of the invention, the identity recognition number that described in, the person of sending registers on described social server is E-mail address or instant chat ID.

Thus, the person of sending passes through registered E-mail address or instant chat ID obtains identity recognition number, can multipathly provide the person of sending relevant identity information, thereby video is sent to the person of sending, and be convenient to accuracy and the security of safeguards system.

The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or the additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments and obviously and easily understand, wherein,

Fig. 1 is the structural drawing of the speech recognition system based on video according to an embodiment of the invention;

Fig. 2 is the process flow diagram of the audio recognition method based on video according to an embodiment of the invention; And

Fig. 3 is that the audio recognition method of user based on video selected good friend's process flow diagram according to an embodiment of the invention.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, being exemplary,, for explaining the present invention, is only limitation of the present invention and can not understand.On the contrary, embodiments of the invention comprise spirit and all changes within the scope of intension, modification and the equivalent that falls into additional claims.

In description of the invention, it should be noted that, unless otherwise clearly defined and limited, term " is connected ", " connection " should be interpreted broadly, for example: can be to be fixedly connected with, also can make to removably connect, or connect integratedly; Can make mechanical connection, can be to be also electrically connected to; Can make to be directly connected, also can indirectly be connected by intermediary.For the ordinary skill in the art, can concrete condition understand above-mentioned term concrete meaning in the present invention.In addition,, in description of the invention, except as otherwise noted, the implication of " a plurality of " is two or more.

It below with reference to Fig. 1, is the speech recognition system based on video of describing the embodiment of the present invention.

As shown in Figure 1, the speech recognition system based on video 1000 of the embodiment of the present invention, comprising: terminal device 100, Cloud Server 200 and social server 300.

Terminal device 100 can be recorded or receiver, video, and gathers the voice signal in video.

In one embodiment of the invention, terminal device can be the equipment that mobile terminal or panel computer etc. have mobile communication function, for example: mobile phone, ipad, PC(Personal Computer, PC) or there is photographing device of communication function etc.Terminal device can be recorded voluntarily an audio or video fragment or receive an audio or video fragment from other approach such as networks.

Because the noise signal that terminal device is recorded or received in video is more, easily form the situations such as noise, be unfavorable for the voice signal in video to analyze, therefore need to carry out noise reduction process to voice signal.

In one embodiment of the invention, after terminal device 100 collects the voice signal in video, terminal device 100 further carries out noise reduction process to voice signal, and the voice signal after noise reduction process is sent to Cloud Server 200.Thus, make the voice signal of acquisition more clear, the more convenient voice messaging to user is confirmed and controls.

The voice signal that Cloud Server 200 receives from described terminal device 100, extracts the voiceprint in voice signal.Wherein, voiceprint comprises a plurality of vocal print features: acoustic feature, lexical characteristics, prosodic features, language feature and channel characteristics.

Respectively various vocal print features are described below.

(1) acoustic feature, for example cepstrum.Cepstrum refers to the new spectrum signal after the Fourier transform that signal spectrum takes the logarithm;

(2) lexical characteristics, the word n-gram that for example speaker is relevant, phoneme n-gram;

(3) prosodic features, the fundamental tone and the energy " posture " that for example utilize n-gram to describe;

(4) language feature, wherein language feature comprises again one or more in languages feature, provincialism and accent feature.Thus, Cloud Server can mate the voice from terminal device by vocal print feature, various informative property, thus consider language feature as much as possible, be more conducive to the person's of sending of voice identity to identify.

(5) channel information, such as having used which kind of passage etc.

After the voiceprint of Cloud Server 200 in extracting voice signal, voiceprint is mated obtain the person's of sending of voice signal identity information with the voiceprint of a plurality of users in the vocal print storehouse prestoring.

The identity information and the voiceprint that vocal print stock, contain a plurality of users, wherein voiceprint and identity information are one to one.Because vocal print has uniqueness, Cloud Server 200 can know by comparing the vocal print of voice whether the current user who sends voice is user itself.

Particularly, in video, can comprise multi-path voice signal, the person of sending of every road voice signal is different respectively.Because each person's of sending voiceprint is different, by the vocal print feature of extracting in voice signal is mated with a plurality of users' that prestore voiceprint, can know that this road voice signal by which user is sent, and knows the person of sending of this road voice signal.

Social server 300 receives from the video of terminal device 100 with from Cloud Server 200 persons' of sending identity information, according to the person's of sending identity information, searches the identity recognition number that the person of sending registers on social server 300.

In one embodiment of the invention, the identity recognition number that the person of sending registers on social server 300 can be E-mail address (email) or instant chat ID.

Social server 300 sends this video according to above-mentioned identity recognition number to the person of sending of corresponding voice signal.Thus, by registering E-mail address used or the ID that chats in time, just can easily obtain the relevant more identity information of the person of sending, thereby video is sent to the person of sending, and be convenient to accuracy and the security of safeguards system.

The video of take below comprises that three users describe the speech recognition system 1000 based on video of the present invention as example.

User U is by terminal device 100 recorded video S.Voice signal V in 100 couples of video S of terminal device gathers, and the voice signal collecting is entered to the capable noise reduction of V, then the voice signal V after noise reduction is sent to Cloud Server 200.

Cloud Server 200, after receiving voice signal V, extracts the voiceprint in this voice signal, wherein in this voiceprint, includes three kinds of different vocal print features, is respectively A, B and C.Cloud Server 200 mates above-mentioned voiceprint with the voiceprint of a plurality of users in the vocal print storehouse prestoring, obtain matching result and be: vocal print feature A respective user M, vocal print feature B respective user N, vocal print feature C respective user W.Thereby can know, the voice signal that vocal print feature A is corresponding is sent by user M, and the voice signal that vocal print feature B is corresponding is sent by user N, and the voice signal that vocal print feature C is corresponding is sent by user W.

In Cloud Server 200, also store each user's identity information.Cloud Server 200 sends to social server 300 by the identity information of above-mentioned matching result and respective user M, N and W.

Social server 300 searches according to the user M, the N that receive and the identity information of W the identity recognition number that it registers on social server 300, then this video is sent to above-mentioned user M, N and W.

According to the terminal device control system of the embodiment of the present invention, the voice that user is sent mate with the voice that prestore in vocal print storehouse, and after the match is successful, user confirms to select and control, by Information Sharings such as videos to the other side, thereby do not need other external units to realize, the selection of terminal device is not controlled, process accurately easily realizes, and has higher accuracy, ease for use and applicability.And, in order to guarantee that voice messaging efficiently mates with the voiceprint in vocal print storehouse, at terminal device 100, collect after voice signal, carried out noise reduction process, make the voice signal that obtains more clear,

As shown in Figure 2, the audio recognition method based on video of the embodiment of the present invention, comprises the steps:

Step S201, terminal device is recorded or is accepted video and gathers the voice signal in video, and voice signal is sent to Cloud Server.The terminal device here can be the equipment that mobile terminal or panel computer etc. have mobile communication function, for example: mobile phone, ipad, PC(Personal Computer, PC) or there is photographing device of communication function etc.Terminal device can be recorded voluntarily an audio or video fragment or receive an audio or video fragment from other approach such as networks.Because the noise signal that terminal device is recorded or received in video is more, easily form the situations such as noise, be unfavorable for the voice signal in video to analyze, therefore need to carry out noise reduction process to voice signal.And the voice signal after noise reduction process is sent to Cloud Server.Thus, make the voice signal of acquisition more clear, the more convenient voice messaging to user is confirmed and controls.

Step S202, Cloud Server received speech signal also extracts the voiceprint in voice signal, and voiceprint is mated with the voiceprint of a plurality of users in the vocal print storehouse prestoring, obtains the person's of sending of voice signal identity information.Particularly, in video, can comprise multi-path voice signal, the person of sending of every road voice signal is different respectively.Because each person's of sending voiceprint is different, by the vocal print feature of extracting in voice signal is mated with the voiceprint of a plurality of users in the voiceprint storehouse prestoring, can know that this road voice signal by which user is sent, and knows the person of sending of this road voice signal.

Step S203, the person's of sending of social server receiver, video and voice signal identity information, and search according to the person's of sending identity information the identity recognition number that the person of sending registers on social server, according to identity recognition number, to the person of sending of corresponding voice signal, send video.Identity recognition number wherein can be by registering E-mail address used (Email) or the ID that chats is in time described.So just, can easily obtain the relevant more identity information of the person of sending, thereby video is sent to the person of sending, and be convenient to accuracy and the security of safeguards system.

Due to a complicated physiology physical process between the production Body Languages maincenter of human language and vocal organs, any two people's vocal print collection of illustrative plates is all variant.Everyone existing relative stability of Speech acoustics feature, variant again, not absolute, unalterable.By this uniqueness of vocal organs that makes full use of everyone, be used as identifying password, make that user is more convenient to be used more naturally whenever and wherever possible.

From utilizing the angle that mathematical method can modeling, the current operable feature of the automatic model of cognition of vocal print comprises:

(1) acoustic feature (cepstrum);

(2) grammar property (the word n-gram that speaker is relevant, phoneme n-gram);

(3) prosodic features (fundamental tone and the energy " posture " that utilize n-gram to describe);

(4) languages, dialect and accent information;

(5) channel information (using which kind of passage); Etc..

If the voice match in the voice that user sends and vocal print storehouse, the vocal print that user sends and semanteme are corresponding with the vocal print prestoring in vocal print storehouse and voice.

Vocal print has uniqueness, can know whether the current user who sends voice is user itself, thereby avoid other people to pretend to be or imitate owner terminal device is controlled by comparing the vocal print of voice, has improved the security that terminal device is controlled.In addition, by the semanteme of voice relatively, can know the action of the terminal device that user expects, thereby can realize exactly the control of terminal device and meet user's expectation.

As shown in Figure 3, the audio recognition method based on video of the embodiment of the present invention is selected good friend's process flow diagram, comprises the steps:

Step S301, sounds.

Step S302, detects and noise reduction process acoustic information, obtains voice messaging more clearly.

Step S303, carries out feature extraction according to vocal print feature to a certain voice messaging.The task of feature extraction is to extract and select speaker's vocal print to have acoustics or the language feature of the characteristics such as separability is strong, stability is high.Different from speech recognition, the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must be " common feature " to speaker.Although what current most of Voiceprint Recognition System was used is all the feature of acoustics aspect, the feature that characterizes a personal touch should be multifaceted.

Step S304, through vocal print registration, to sound-groove model, this is the basic steps of the Model Matching of vocal print.

Step S305, by the feature extraction to the voice messaging of a certain voice, directly confirms to differentiate with vocal print arrival mode coupling through vocal print.Model Matching and sound-groove model are synergistic processes.After Model Matching, just can confirm the transmitting terminal that will carry out.

For pattern-recognition, there are following several large class methods:

Stencil matching method: utilize dynamic time bending (DTW) to aim at training and testing characteristic sequence, be mainly used in the application (being generally text-dependent task) of fixed phrases;

Arest neighbors method: retain all eigenvectors during training, during identification, each vector is found to K nearest in trained vector, identify accordingly, model storage is all very large with the amount of similar calculating conventionally;

Neural net method: have a variety of forms, as Multilayer Perception, radial basis function (RBF) etc., can show that training is to distinguish speaker and its background speaker, its training burden is very large, and the replicability of model is bad;

Hidden Markov model (HMM) method: conventionally use the HMM of single state, or gauss hybrid models (GMM), be popular method, effect is relatively good;

Step S306, has determined correct information transmitting terminal, by video sharing, gives this good friend.

To search the voice of terminal device, take and the control method of the embodiment of the present invention is described as example to this good friend video sharing below.Vocal print storehouse of model, voiceprint corresponding to each good friend deposited in the inside, then by feature extraction, extracts and select speaker's vocal print to have acoustics or the language feature of the characteristics such as analyticity is strong, stability is high.Different from speech recognition, the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must be " common feature " to speaker.Model Matching, from utilizing the angle that mathematical method can modeling.After by sound-groove model, the match is successful, personage in locking video, if when matching speaker and being certain good friend in vocal print storehouse, read this speaker ' s identity information, completes identification.After identification good friend's identity information, can, by the social application of SNS, by video sharing, give relevant good friend.This technology has solved needs a plurality of contact person good friends of human brain memory, through the accumulation of practice, be easy to forget met once but the friend that extremely do not know well, and when user wants to be shared with the good friends in video, good friend's information identity, more awkward problem are can not remember in discovery.This technology is applicable to, in account system or SNS Related product, can be used on future such as the projects such as Baidu space etc.

It should be noted that, user has stored good friend's voice voiceprint in advance in vocal print storehouse.

According to the control method of the mobile terminal of the embodiment of the present invention, the voice that user is sent mate with the voice that prestore in vocal print storehouse, and after the match is successful, terminal device is controlled, thereby do not need other external units can realize the control to mobile terminal, process is simple and easy to realize, and has higher ease for use and applicability.And, utilize the voice message that user sends to control mobile terminal, be difficult for being imitated and pretending to be by other people, there is higher security.In addition, in order to guarantee that voice messaging efficiently mates with the voiceprint in vocal print storehouse, will be to collecting to such an extent that voice signal carries out noise reduction process, make the voice signal that obtains more clear, be convenient to sender and analyze and obtain correct video sharing person.

In process flow diagram or any process of otherwise describing at this or method describe and can be understood to, represent to comprise that one or more is for realizing module, fragment or the part of code of executable instruction of the step of specific logical function or process, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by contrary order, carry out function, this should be understood by embodiments of the invention person of ordinary skill in the field.

The logic and/or the step that in process flow diagram, represent or otherwise describe at this, for example, can be considered to for realizing the sequencing list of the executable instruction of logic function, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise that the system of processor or other can and carry out the system of instruction from instruction execution system, device or equipment instruction fetch), use, or use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can comprise, storage, communication, propagation or transmission procedure be for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically of computer-readable medium (non-exhaustive list) comprises following: the electrical connection section (electronic installation) with one or more wirings, portable computer diskette box (magnetic device), random-access memory (ram), ROM (read-only memory) (ROM), the erasable ROM (read-only memory) (EPROM or flash memory) of editing, fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other the suitable medium that can print described program thereon, because can be for example by paper or other media be carried out to optical scanning, then edit, decipher or process in electronics mode and obtain described program with other suitable methods if desired, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in storer and by software or the firmware of suitable instruction execution system execution.For example, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: have for data-signal being realized to the discrete logic of the logic gates of logic function, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is to come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, this program, when carrying out, comprises step of embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can be also that the independent physics of unit exists, and also can be integrated in a module two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.If described integrated module usings that the form of software function module realizes and during as production marketing independently or use, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.

In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, the schematic statement of above-mentioned term is not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or feature can be with suitable mode combinations in any one or more embodiment or example.

Although illustrated and described embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, those of ordinary skill in the art can change above-described embodiment within the scope of the invention in the situation that not departing from principle of the present invention and aim, modification, replacement and modification.Scope of the present invention is extremely equal to and limits by claims.

Claims

1. the speech recognition system based on video, is characterized in that, comprising:

Terminal device, for recording or receiver, video, and gathers the voice signal in described video;

Cloud Server, for receiving the described voice signal from described terminal device, extract the voiceprint in described voice signal, and the identity information that described voiceprint is mated obtain the person of sending of described voice signal with the voiceprint of a plurality of users in the vocal print storehouse prestoring, wherein, described vocal print stock contains a plurality of users' identity information and voiceprint, and wherein said voiceprint is corresponding one by one with described identity information; And

Social server, for receive described video and described in the person's of sending identity information, the identity recognition number that described in searching according to the described person's of sending identity information, the person of sending registers on described social server, and to the person of sending of the described voice signal of correspondence, send described video according to described identity recognition number.

2. the speech recognition system based on video as claimed in claim 1, is characterized in that, described voiceprint comprises a plurality of vocal print features, and wherein, described vocal print feature comprises acoustic feature, lexical characteristics, prosodic features, language feature and channel characteristics.

3. the speech recognition system based on video as claimed in claim 2, is characterized in that, described language feature comprises one or more in languages feature, provincialism and accent feature.

4. the speech recognition system based on video as claimed in claim 1, is characterized in that, described terminal device is also for the described voice signal collecting is carried out to noise reduction process, and the voice signal after noise reduction process is sent to described Cloud Server.

5. the speech recognition system based on video as claimed in claim 1, is characterized in that, described in the identity recognition number registered on described social server of the person of sending be E-mail address or instant chat ID.

6. the audio recognition method based on video, is characterized in that, comprises the steps:

Terminal device is recorded or receiver, video, and gathers the voice signal in described video, and described voice signal is sent to Cloud Server;

7. the audio recognition method based on video as claimed in claim 6, is characterized in that, described voiceprint comprises a plurality of vocal print features, and wherein, described vocal print feature comprises acoustic feature, lexical characteristics, prosodic features, language feature and channel characteristics.

8. the audio recognition method based on video as claimed in claim 7, is characterized in that, described language feature comprises one or more in languages feature, provincialism and accent feature.

9. the audio recognition method based on video as claimed in claim 6, it is characterized in that, after the voice signal of described terminal device in collecting described video, also comprise the steps: described voice signal to carry out noise reduction process, and the voice signal after noise reduction process is sent to described Cloud Server.

10. the audio recognition method based on video as claimed in claim 1, is characterized in that, described in the identity recognition number registered on described social server of the person of sending be E-mail address or instant chat ID.