CN103680497B - Speech recognition system and method based on video - Google Patents

Speech recognition system and method based on video Download PDF

Info

Publication number
CN103680497B
CN103680497B CN201210320742.3A CN201210320742A CN103680497B CN 103680497 B CN103680497 B CN 103680497B CN 201210320742 A CN201210320742 A CN 201210320742A CN 103680497 B CN103680497 B CN 103680497B
Authority
CN
China
Prior art keywords
video
voice signal
person
sending
voiceprint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210320742.3A
Other languages
Chinese (zh)
Other versions
CN103680497A (en
Inventor
王玲珑
曹晨曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210320742.3A priority Critical patent/CN103680497B/en
Publication of CN103680497A publication Critical patent/CN103680497A/en
Application granted granted Critical
Publication of CN103680497B publication Critical patent/CN103680497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The present invention proposes a kind of speech recognition system based on video, including:Terminal unit, for recording or receiving video and gather the voice signal in video;Cloud Server, for receiving the voice signal from terminal unit, extracts the voiceprint in voice signal, voiceprint is mated to obtain the identity information of the voice signal person of sending with the voiceprint of the multiple users in the vocal print storehouse for prestoring;Social interaction server device, for receiving the identity information of video and the person of sending, searches the identity recognition number that the person of sending is registered in social interaction server device according to the person's of sending identity information, and sends video according to identity recognition number to the corresponding voice signal person of sending.Invention additionally discloses a kind of audio recognition method based on video.The present invention knows the identity information of user by recognizing vocal print, by the identity information coupling of user after, for, accurate by Information Sharings such as videos to other side.

Description

Speech recognition system and method based on video
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of speech recognition system and side based on video Method.
Background technology
Speech recognition technology has been widely used for, among people's daily life, bringing a lot of problems therewith.For example, exist How speech recognition technology is applied in account system or SNS Related products, so as to letters such as efficient, accurate transmission or sharing video frequencies Cease to other side.The multiple contact person good friends of human brain memory are needed in account system instantly and SNS Related products, through the tired of practice Product, it is easy to forget once met but the friend extremely not known well, and the good friends in user thinks sharing information to video When, the information identity that can not remember good friend is found, comparison is awkward.Memory that these problems can only pass through user itself is solved at present Realize with manual analyzing, efficiency is low, accuracy is low.
Content of the invention
It is contemplated that at least solving one of above-mentioned technical problem.
For this purpose, it is an object of the present invention to proposing a kind of speech recognition system based on video, the system can be led to Cross speech recognition, the convenient and accurate identity for passing through the user in speech recognition video.Further object is that Propose a kind of control device of terminal unit.
To achieve these goals, the embodiment of first aspect present invention provides a kind of Mobile terminal control system, including Following steps:Terminal unit, for recording or receiving video, and gathers the voice signal in the video;Cloud Server, is used for The voice signal from the terminal unit is received, the voiceprint in the voice signal is extracted, and by the sound Stricture of vagina information is mated to obtain the person of sending of the voice signal with the voiceprint of the multiple users in the vocal print storehouse for prestoring Identity information, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described voiceprint Correspond with the identity information;And social interaction server device, for receiving the identity information of the video and the person of sending, The identity recognition number that the person of sending according to the identity information of the person of sending is searched is registered in the social interaction server device, and root The video is sent according to the identity recognition number to the person of sending of the corresponding voice signal.
Terminal unit control system according to embodiments of the present invention, prestoring in voice and vocal print storehouse that user is sent Voice is mated, and after the match is successful, and user carries out confirming selection and control, by Information Sharings such as videos to other side, from The selection to terminal unit be can achieve without other external equipments to control, process is accurately easily realized, with higher standard Really property, ease for use and the suitability.
In one embodiment of the invention, the voiceprint includes multiple vocal print features, wherein, the vocal print feature Including acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.
In yet another embodiment of the present invention, the language feature includes languages feature, provincialism and accent feature In one or more.
Thus, Cloud Server can be mated to the voice from terminal unit by vocal print feature, various informative property, So as to consider language feature as much as possible, more conducively the identity of the person of sending of voice is identified.
In the present invention in one embodiment, the terminal unit is additionally operable to the voice signal to collecting to be carried out Noise reduction process, and by noise reduction process after voice signal send to the Cloud Server.
As a result, the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user is confirmed and is controlled.
In another embodiment of invention, the identity recognition number that the person of sending is registered in the social interaction server device is E-mail address or instant chat ID.
Thus, by registration E-mail address used or the ID that chats in time, the relevant person of sending just can easily be obtained more Identity information, so as to video is sent to the person of sending, and be easy to accuracy and the safety of safeguards system.
The embodiment of second aspect present invention proposes a kind of audio recognition method based on video, comprises the steps:Eventually End equipment is recorded or receives video, and gathers the voice signal in the video, and the voice signal is sent to cloud clothes Business device;
Voice signal described in the cloud server, and extract the voiceprint in the voice signal, and by institute State voiceprint and mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring, obtain sending for the voice signal The identity information of person, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described vocal print letter Breath is corresponded with the identity information;And
Social interaction server device receives the identity information of the person of sending of the video and the voice signal, and is sent according to described The identity information of person searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identity Identifier sends the video to the person of sending of the corresponding voice signal.
Audio recognition method based on video according to embodiments of the present invention, in the voice that user is sent and vocal print storehouse The voice for prestoring is mated, and after the match is successful, user carries out confirming selection and control, by Information Sharings such as videos to right Side, can achieve the selection to terminal unit without other external equipments and controls, and process is accurately easily realized, with higher Accuracy, ease for use and the suitability.
In one embodiment of the invention, the voiceprint includes multiple vocal print features, wherein, the vocal print feature Including acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.
In another embodiment of the present invention, the language feature is included in languages feature, provincialism and accent feature One or more.
Thus, Cloud Server can be mated to the voice from terminal unit by vocal print feature, various informative property, So as to consider language feature as much as possible, more conducively the identity of the person of sending of voice is identified.
In yet another embodiment of the present invention, after voice signal of the terminal unit in the video is collected, Also comprise the steps:Carry out noise reduction process to the voice signal, and by noise reduction process after voice signal send to described Cloud Server.
So that the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user is confirmed and is controlled.
In one embodiment of the invention, the identity recognition number that the person of sending is registered in the social interaction server device is E-mail address or instant chat ID.
Thus, the person of sending passes through registered E-mail address or instant chat ID obtains identity recognition number, can be multipath The person of sending is provided related identity information, so as to video is sent to the person of sending, and is easy to accuracy and the peace of safeguards system Quan Xing.
The additional aspect of the present invention and advantage will be set forth in part in the description, and partly will become from the following description Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
The above-mentioned and/or additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments Substantially and easy to understand, wherein,
Fig. 1 is the structure chart of the speech recognition system based on video according to one embodiment of the invention;
Fig. 2 is the flow chart of the audio recognition method based on video according to one embodiment of the invention;And
Fig. 3 is the flow process that the audio recognition method according to the user of one embodiment of the invention based on video selects good friend Figure.
Specific embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from start to finish Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and it is limitation of the present invention not to be understood that.Conversely, this Inventive embodiment includes all changes, modification and the equivalent fallen in the range of the spirit and intension of attached claims Thing.
In describing the invention, it should be noted that unless otherwise clearly defined and limited, term " being connected ", " company Connect " should be interpreted broadly, for example:Can be fixedly connected, it is also possible to make to be detachably connected, or be integrally connected;Machine can be made Tool connection, or electrical connection;Can make to be joined directly together, it is also possible to be indirectly connected to by intermediary.For this area For those of ordinary skill, above-mentioned term concrete meaning in the present invention can be understood with concrete condition.Additionally, the present invention's In description, unless otherwise stated, " multiple " are meant that two or more.
It is the speech recognition system based on video for describing the embodiment of the present invention below with reference to Fig. 1.
As shown in figure 1, the speech recognition system 1000 based on video of the embodiment of the present invention, including:Terminal unit 100, Cloud Server 200 and social interaction server device 300.
Video can be recorded or be received to terminal unit 100, and gather the voice signal in video.
In one embodiment of the invention, terminal unit can be that mobile terminal or panel computer etc. have mobile communication The equipment of function, for example:Mobile phone, ipad, PC(Personal Computer, PC)Or taking pictures with communication function Equipment etc..Terminal unit can voluntarily record an audio or video fragment or other approach receive an audio frequency from network etc. Or video segment.
As terminal unit is recorded or to receive noise signal in video more, situations such as easily form noise, unfavorable Voice signal in video is analyzed, it is therefore desirable to carry out noise reduction process to voice signal.
In one embodiment of the invention, after terminal unit 100 collects the voice signal in video, terminal unit 100 further carry out noise reduction process to voice signal, and by noise reduction process after voice signal send to Cloud Server 200.By This so that the voice signal of acquisition becomes apparent from, and the more convenient voice messaging to user is confirmed and controlled.
Cloud Server 200 receives the voice signal from the terminal unit 100, extracts the vocal print letter in voice signal Breath.Wherein, voiceprint includes multiple vocal print features:Acoustic featuress, lexical characteristics, prosodic features, language feature and passage are special Levy.
Separately below various vocal print features are described.
(1)Acoustic featuress, such as cepstrum.Cepstrum refers to the new frequency spectrum after the Fourier transform that signal spectrum is taken the logarithm Signal;
(2)Lexical characteristics, such as speaker related word n-gram, phoneme n-gram;
(3)Prosodic features, the fundamental tone for for example being described using n-gram and energy " posture ";
(4)Language feature, wherein language feature include the one kind or many in languages feature, provincialism and accent feature again Kind.Thus, Cloud Server can be mated to the voice from terminal unit by vocal print feature, various informative property, so as to Language feature as much as possible is considered, more conducively the identity of the person of sending of voice is identified.
(5)Channel information, for example, used which kind of passage etc..
After voiceprint of the Cloud Server 200 in voice signal is extracted, by voiceprint and the vocal print storehouse for prestoring The voiceprint of multiple users mated to obtain the identity information of the person of sending of voice signal.
In identity information and voiceprint that vocal print stock contains multiple users, wherein voiceprint is one with identity information One is corresponding.As vocal print has a uniqueness, Cloud Server 200 currently sends voice by comparing the vocal print of voice and can know User whether be user itself.
Specifically, multi-path voice signal can be included in video, the person of sending per road voice signal is respectively different.Due to The voiceprint of each person of sending is different, by by the vocal print of the vocal print feature that extracts in voice signal and the multiple users for prestoring Information is mated, you can knows that the road voice signal is sent by which user, that is, knows sending for the road voice signal Person.
Social interaction server device 300 receives the video from terminal unit 100 and the letter of the identity from 200 person of sending of Cloud Server Breath, searches the identity recognition number that the person of sending is registered in social interaction server device 300 according to the identity information of the person of sending.
In one embodiment of the invention, the identity recognition number that the person of sending is registered in social interaction server device 300 can be E-mail address(email)Or instant chat ID.
Social interaction server device 300 sends the video according to above-mentioned identity recognition number to the person of sending of corresponding voice signal.By This, by registration E-mail address used or the ID that chats in time, just can easily obtain the more identity informations of the relevant person of sending, So as to video is sent to the person of sending, and it is easy to accuracy and the safety of safeguards system.
Of the invention is carried out based on the speech recognition system 1000 of video so that video includes three users as an example below Explanation.
User U passes through 100 recorded video S of terminal unit.Terminal unit 100 is adopted to the voice signal V in video S Collection, and enters V row noise reductions to the voice signal for collecting, then by noise reduction after voice signal V be sent to Cloud Server 200.
Cloud Server 200 extracts the voiceprint in the voice signal after voice signal V is received, wherein in the sound Three kinds of different vocal print features, respectively A, B and C are included in stricture of vagina information.Cloud Server 200 by above-mentioned voiceprint with prestore Vocal print storehouse in the voiceprint of multiple users mated, obtaining matching result is:The corresponding user M of vocal print feature A, vocal print The corresponding user N of feature B, the corresponding user W of vocal print feature C.Such that it is able to know, the corresponding voice signal of vocal print feature A is by user M Send, the corresponding voice signal of vocal print feature B is sent by user N, the corresponding voice signal of vocal print feature C is sent by user W.
Also be stored with Cloud Server 200 identity information of each user.Cloud Server 200 by above-mentioned matching result and The identity information of corresponding user M, N and W is sent to social interaction server device 300.
Social interaction server device 300 searches which in social interaction server device 300 according to the identity information of user M, N and W for receiving Then the video is sent to above-mentioned user M, N and W by the identity recognition number of registration.
Terminal unit control system according to embodiments of the present invention, prestoring in voice and vocal print storehouse that user is sent Voice is mated, and after the match is successful, and user carries out confirming selection and control, by Information Sharings such as videos to other side, from The selection to terminal unit be can achieve without other external equipments to control, process is accurately easily realized, with higher standard Really property, ease for use and the suitability.Also, in order to ensure voice messaging and the voiceprint efficient matchings in vocal print storehouse, in terminal After equipment 100 collects voice signal, carried out noise reduction process, so as to get voice signal become apparent from,
As shown in Fig. 2 the audio recognition method based on video of the embodiment of the present invention, comprises the steps:
Step S201, terminal unit are recorded or receive video and gather the voice signal in video, and voice signal is sent To Cloud Server.Here terminal unit can be the equipment with mobile communication function such as mobile terminal or panel computer, example Such as:Mobile phone, ipad, PC(Personal Computer, PC)Or the photographing device with communication function etc..Terminal sets Standby can voluntarily record an audio or video fragment or other approach receive audio or video fragments from network etc..By Record in terminal unit or to receive noise signal in video more, situations such as easily form noise, in being unfavorable for video Voice signal be analyzed, it is therefore desirable to noise reduction process is carried out to voice signal.And by noise reduction process after voice signal send out Deliver to Cloud Server.As a result, the voice signal for obtaining becomes apparent from, the more convenient voice messaging to user confirmed and Control.
Step S202, cloud server voice signal simultaneously extract the voiceprint in voice signal, by voiceprint with The voiceprint of the multiple users in the vocal print storehouse for prestoring is mated, and obtains the identity information of the person of sending of voice signal.Tool Body ground, can include multi-path voice signal in video, and the person of sending per road voice signal is respectively different.Due to each person of sending Voiceprint different, by by the multiple users' in the vocal print feature that extracts in voice signal and the voiceprint storehouse for prestoring Voiceprint is mated, you can is known that the road voice signal is sent by which user, that is, is known the road voice signal The person of sending.
Step S203, social interaction server device receive the identity information of the person of sending of video and voice signal, and according to the person of sending Identity information search the identity recognition number registered in social interaction server device of the person of sending, according to identity recognition number to corresponding voice The person of sending of signal sends video.Identity recognition number therein can be the E-mail address used by registration(Email)Or and When chat ID described by.Relevant send person more identity information so just easily can be obtained, so as to video to be sent to The person of sending, and it is easy to accuracy and the safety of safeguards system.
Due to a complicated physiology physical process between the production Body Languages maincenter and phonatory organ of human language, The voiceprint map of any two people is all variant.The existing relative stability of everyone speech acoustics feature, but variant, no Be absolute, unalterable.It is used as recognizing password by making full use of everyone phonatory organ this uniqueness, uses Family is more convenient to use whenever and wherever possible more naturally.
From the feature bag that the angle that can be modeled using mathematical method, vocal print automatic identification model can be used at present Include:
(1)Acoustic featuress(Cepstrum);
(2)Grammar property(Speaker related word n-gram, phoneme n-gram);
(3)Prosodic features(The fundamental tone described using n-gram and energy " posture ");
(4)Languages, dialect and accent information;
(5)Channel information(Using which kind of passage);Etc..
If the voice that user sends and the voice match in vocal print storehouse, the vocal print that user sends and semantic and vocal print storehouse In the vocal print that prestores and voice be corresponding.
Vocal print has uniqueness, can know whether the user for currently sending voice is user by comparing the vocal print of voice Itself, is controlled to terminal unit so as to avoiding other people from pretending to be or imitating owner, improves the safety of terminal unit control. Additionally, the action of the desired terminal unit of user can be known by comparing the semanteme of voice, right such that it is able to realize exactly The expectation for controlling and meeting user of terminal unit.
As shown in figure 3, the flow chart that the audio recognition method based on video of the embodiment of the present invention selects good friend, including such as Lower step:
Step S301, sends sound.
Step S302, is detected to acoustic information and noise reduction process, obtains apparent voice messaging.
Step S303, carries out feature extraction according to vocal print feature to a certain voice messaging.The task of feature extraction is to extract And select that there is the acoustics of characteristic or the language features such as separability is strong, stability is high to the vocal print of speaker.With speech recognition not With the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must be " general character for speaker Feature ".Although at present major part Voiceprint Recognition System be all acoustics aspect feature, characterize the spy of a personal touch It is multifaceted to levy.
Step S304, through voiceprint registration, to sound-groove model, this is the basic steps of the Model Matching of vocal print.
Step S305, by the feature extraction of the voice messaging to a certain voice, is directly over vocal print and confirms to be reflected with vocal print Not, arrival mode coupling.Model Matching is a synergistic process with sound-groove model.After Model Matching, just can be true Recognize transmitting terminal to be performed.
For pattern recognition, there are following a few big class methods:
Stencil matching method:Bent using dynamic time(DTW)To be directed at training and test feature sequence, it is mainly used in solid Determine the application of phrase(Usually text inter-related task);
Arest neighbors method:Retain all characteristic vectors during training, each vector is found in trained vector most during identification Near K, is identified accordingly, and the amount of usual model storage and similar calculating is all very big;
Neural net method:There are a variety of forms, such as multilayer perception, RBF(RBF)Deng, can show training with Speaker and its background speaker is distinguished, its training burden is very big, and the replicability of model is bad;
Hidden Markov model(HMM)Method:The HMM of single state, or gauss hybrid models are usually used(GMM), it is Popular method, effect are relatively good;
Video sharing is given the good friend by step S306, it is determined that correct information transmitting terminal.
Control by the voice of lookup terminal unit as a example by giving the good friend by video sharing to the embodiment of the present invention below Method is described.A vocal print storehouse is initially set up, the inside is deposited the corresponding voiceprint of each good friend, then carried by feature Take, extract and select that there is to the vocal print of speaker the acoustics of characteristic or the language features such as analyticity is strong, stability is high.With language Sound identification is different, and the feature of Application on Voiceprint Recognition must be " personalization " feature, and the feature of Speaker Identification must for speaker Must be " common feature ".Model Matching, from the angle that can be modeled using mathematical method.By sound-groove model, the match is successful Afterwards, personage in locking video, if to match speaker be certain good friend in vocal print storehouse, reads speaker's identity letter Breath, completes to recognize.After the identity information of identification good friend, can be by SNS social networking applications, by video sharing to related good friend.This One technology is solved needs the multiple contact person good friends of human brain memory, through practice accumulation, it is easy to forget once met but The friend extremely not known well, and when user wants the good friends being shared with video, find the information body that can not remember good friend Part, the awkward problem of comparison.This technology is applied in account system or SNS Related products, and future can be used on Baidu space Etc. project etc..
It should be noted that user stores the voice voiceprint of good friend in advance in vocal print storehouse.
The control method of mobile terminal according to embodiments of the present invention, by prestored in the voice of user's transmission and vocal print storehouse Voice is mated, and terminal unit is controlled after the match is successful, be can achieve without other external equipments Control to mobile terminal, process are simply easily realized, with higher ease for use and the suitability.Also, sent using user Voice message is controlled to mobile terminal, is difficult to be imitated by other people and is pretended to be, with higher safety.Additionally, in order to Ensure the voiceprint efficient matchings in voice messaging and vocal print storehouse, will to collect voice signal carries out noise reduction process, make The voice signal for obtaining becomes apparent from, and is easy to sender's analysis to obtain correct video sharing person.
In flow chart or here any process described otherwise above or method description are construed as, expression includes One or more for realizing specific logical function or process the step of the module of code of executable instruction, fragment or portion Point, and the scope of the preferred embodiment of the present invention includes other realization, can not wherein press the suitable of shown or discussion Sequence, including according to involved function by basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Represent in flow charts or here logic described otherwise above and/or step, for example, it is possible to be considered as to use In the order list of the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment(As computer based system, include processor system or other can hold from instruction Row system, device or equipment instruction fetch the system of execute instruction)Use, or with reference to these instruction execution systems, device or set Standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass The dress that defeated program is used for instruction execution system, device or equipment or with reference to these instruction execution systems, device or equipment Put.The more specifically example of computer-readable medium(Non-exhaustive list)Including following:There is the electricity of one or more wirings Connecting portion(Electronic installation), portable computer diskette box(Magnetic device), random access memory(RAM), read only memory (ROM), erasable edit read-only storage(EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir(CDROM).In addition, computer-readable medium can even is that the paper that can print described program thereon or other are suitable Medium, because for example by carrying out optical scanning to paper or other media edlin, interpretation can then be entered or if necessary with which His suitable method is processed to electronically obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realizing.For example, if realized with hardware, and in another embodiment, can be with well known in the art Any one of row technology or their combination are realizing:There is the logic gates for being used for realizing logic function to data signal Discrete logic, the special IC with suitable combinational logic gate circuit, programmable gate array(PGA), scene Programmable gate array(FPGA)Deng.
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Suddenly the hardware that can be by program to instruct correlation is completed, and described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
Additionally, each functional unit in each embodiment of the invention can be integrated in a processing module, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a module.Above-mentioned integrated mould Block both can be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit using in the form of software function module realize and as independent production marketing or use when, it is also possible to be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or the spy described with reference to the embodiment or example Point is contained at least one embodiment or example of the present invention.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example are necessarily referred to.And, the specific features of description, structure, material or feature can be any One or more embodiments or example in combine in an appropriate manner.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art is in the principle and objective without departing from the present invention In the case of above-described embodiment can be changed within the scope of the invention, change, replace and modification.The scope of the present invention Extremely it is equal to limit by claims.

Claims (10)

1. a kind of speech recognition system based on video, it is characterised in that include:
Terminal unit, for recording or receiving video, and gathers the voice signal in the video;
Cloud Server, for receiving the voice signal from the terminal unit, extracts the vocal print in the voice signal Information, and the voiceprint is mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring described to obtain The identity information of the person of sending of voice signal, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, Wherein described voiceprint is corresponded with the identity information;And
Social interaction server device, for receiving the identity information of the video and the person of sending, believes according to the identity of the person of sending Breath searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identity recognition number to correspondingly The person of sending of the voice signal send the video.
2. the speech recognition system based on video as claimed in claim 1, it is characterised in that the voiceprint includes multiple Vocal print feature, wherein, the vocal print feature includes acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.
3. the speech recognition system based on video as claimed in claim 2, it is characterised in that the language feature includes languages One or more in feature, provincialism and accent feature.
4. the speech recognition system based on video as claimed in claim 1, it is characterised in that it is right that the terminal unit is additionally operable to The voice signal for collecting carries out noise reduction process, and by noise reduction process after voice signal send to the Cloud Server.
5. the speech recognition system based on video as claimed in claim 1, it is characterised in that the person of sending is in the social activity The identity recognition number that registers on server is E-mail address or instant chat ID.
6. a kind of audio recognition method based on video, it is characterised in that comprise the steps:
Terminal unit is recorded or receives video, and gathers the voice signal in the video, and the voice signal is sent To Cloud Server;
Voice signal described in the cloud server, and extract the voiceprint in the voice signal, and by the sound Stricture of vagina information is mated with the voiceprint of the multiple users in the vocal print storehouse for prestoring, and obtains the person's of sending of the voice signal Identity information, wherein, the vocal print stock contains the identity information and voiceprint of multiple users, wherein described voiceprint with The identity information is corresponded;And
Social interaction server device receives the identity information of the person of sending of the video and the voice signal, and according to the person's of sending Identity information searches the identity recognition number that the person of sending is registered in the social interaction server device, and according to the identification Number the video is sent to the person of sending of the corresponding voice signal.
7. the audio recognition method based on video as claimed in claim 6, it is characterised in that the voiceprint includes multiple Vocal print feature, wherein, the vocal print feature includes acoustic featuress, lexical characteristics, prosodic features, language feature and channel characteristics.
8. the audio recognition method based on video as claimed in claim 7, it is characterised in that the language feature includes languages One or more in feature, provincialism and accent feature.
9. the audio recognition method based on video as claimed in claim 6, it is characterised in that the terminal unit is being collected After voice signal in the video, also comprise the steps:Carry out noise reduction process to the voice signal, and by noise reduction process Voice signal afterwards is sent to the Cloud Server.
10. the audio recognition method based on video as claimed in claim 6, it is characterised in that the person of sending is in the society It is E-mail address or instant chat ID to hand over the identity recognition number that registers on server.
CN201210320742.3A 2012-08-31 2012-08-31 Speech recognition system and method based on video Active CN103680497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210320742.3A CN103680497B (en) 2012-08-31 2012-08-31 Speech recognition system and method based on video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210320742.3A CN103680497B (en) 2012-08-31 2012-08-31 Speech recognition system and method based on video

Publications (2)

Publication Number Publication Date
CN103680497A CN103680497A (en) 2014-03-26
CN103680497B true CN103680497B (en) 2017-03-15

Family

ID=50317851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210320742.3A Active CN103680497B (en) 2012-08-31 2012-08-31 Speech recognition system and method based on video

Country Status (1)

Country Link
CN (1) CN103680497B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9837068B2 (en) * 2014-10-22 2017-12-05 Qualcomm Incorporated Sound sample verification for generating sound detection model
CN104899313A (en) * 2015-06-16 2015-09-09 陕西师范大学 Roll call method and system
CN106486117A (en) * 2015-08-27 2017-03-08 中兴通讯股份有限公司 A kind of sharing files method and device
CN105376143B (en) * 2015-12-03 2019-05-07 小天才科技有限公司 A kind of method and device identifying identity of the sender
CN107333090B (en) * 2016-04-29 2020-04-07 中国电信股份有限公司 Video conference data processing method and platform
CN107767864B (en) * 2016-08-23 2021-06-29 阿里巴巴集团控股有限公司 Method and device for sharing information based on voice and mobile terminal
CN107610704A (en) * 2017-09-29 2018-01-19 珠海市领创智能物联网研究院有限公司 A kind of speech recognition system for smart home
CN108960836B (en) * 2017-12-27 2021-09-14 北京猎户星空科技有限公司 Voice payment method, device and system
CN110010135B (en) * 2018-01-05 2024-05-07 北京搜狗科技发展有限公司 Speech-based identity recognition method and device and electronic equipment
CN108257605B (en) * 2018-02-01 2021-05-04 Oppo广东移动通信有限公司 Multi-channel recording method and device and electronic equipment
CN108694952B (en) * 2018-04-09 2020-04-28 平安科技(深圳)有限公司 Electronic device, identity authentication method and storage medium
CN110399524A (en) * 2018-04-19 2019-11-01 陈伯豪 Mobile device, server and the system of language learning information are provided according to video or the sound of audio
CN110428824A (en) * 2018-04-28 2019-11-08 深圳市冠旭电子股份有限公司 A kind of exchange method of intelligent sound box, device and intelligent sound box
CN109256136B (en) * 2018-08-31 2021-09-17 三星电子(中国)研发中心 Voice recognition method and device
JP7175696B2 (en) * 2018-09-28 2022-11-21 キヤノン株式会社 IMAGE PROCESSING SYSTEM, IMAGE PROCESSING APPARATUS, AND CONTROL METHOD THEREOF
CN109544745A (en) * 2018-11-20 2019-03-29 北京千丁互联科技有限公司 A kind of intelligent door lock control method, apparatus and system
CN111081080B (en) * 2019-05-29 2022-05-03 广东小天才科技有限公司 Voice detection method and learning device
CN111048072A (en) * 2019-11-21 2020-04-21 中国南方电网有限责任公司 Voiceprint recognition method applied to power enterprises
CN111191074A (en) * 2019-12-10 2020-05-22 秒针信息技术有限公司 Member information query method and system based on voiceprint recognition
CN111554303B (en) * 2020-05-09 2023-06-02 福建星网视易信息系统有限公司 User identity recognition method and storage medium in song singing process

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567431A (en) * 2003-07-10 2005-01-19 上海优浪信息科技有限公司 Method and system for identifying status of speaker
CN102413101A (en) * 2010-09-25 2012-04-11 盛乐信息技术(上海)有限公司 Voice-print authentication system having voice-print password voice prompting function and realization method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09294172A (en) * 1996-04-26 1997-11-11 Hitachi Ltd Voice transmitter
JP2007241130A (en) * 2006-03-10 2007-09-20 Matsushita Electric Ind Co Ltd System and device using voiceprint recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567431A (en) * 2003-07-10 2005-01-19 上海优浪信息科技有限公司 Method and system for identifying status of speaker
CN102413101A (en) * 2010-09-25 2012-04-11 盛乐信息技术(上海)有限公司 Voice-print authentication system having voice-print password voice prompting function and realization method thereof

Also Published As

Publication number Publication date
CN103680497A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103680497B (en) Speech recognition system and method based on video
CN110313152B (en) User registration for an intelligent assistant computer
US11335347B2 (en) Multiple classifications of audio data
US10884503B2 (en) VPA with integrated object recognition and facial expression recognition
WO2019000991A1 (en) Voice print recognition method and apparatus
CN105869641A (en) Speech recognition device and speech recognition method
US20200135213A1 (en) Electronic device and control method thereof
CN110427462A (en) With method, apparatus, storage medium and the service robot of user interaction
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
CN109074397B (en) Information processing system and information processing method
CN102404278A (en) Song request system based on voiceprint recognition and application method thereof
CN113168832A (en) Alternating response generation
JP6927318B2 (en) Information processing equipment, information processing methods, and programs
WO2020263547A1 (en) Emotion detection using speaker baseline
CN109036393A (en) Wake-up word training method, device and the household appliance of household appliance
CN103635962A (en) Voice recognition system, recognition dictionary logging system, and audio model identifier series generation device
CN108711429A (en) Electronic equipment and apparatus control method
US11455998B1 (en) Sensitive data control
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN109961786A (en) Products Show method, apparatus, equipment and storage medium based on speech analysis
CN109166571A (en) Wake-up word training method, device and the household appliance of household appliance
CN105679323B (en) A kind of number discovery method and system
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN109922397B (en) Intelligent audio processing method, storage medium, intelligent terminal and intelligent Bluetooth headset
US10978069B1 (en) Word selection for natural language interface

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant