CN109031201A

CN109031201A - The voice localization method and device of Behavior-based control identification

Info

Publication number: CN109031201A
Application number: CN201810557504.1A
Authority: CN
Inventors: 卢启伟; 杨宁; 刘胜强
Original assignee: Shenzhen Eaglesoul Technology Co Ltd
Current assignee: Shenzhen Eaglesoul Technology Co Ltd
Priority date: 2018-06-01
Filing date: 2018-06-01
Publication date: 2018-12-18
Also published as: WO2019227552A1

Abstract

The disclosure is directed to voice localization method, device, electronic equipment and the storage mediums of a kind of identification of Behavior-based control.Wherein, this method comprises: when receiving special sound signal, the vision signal that the temporal information for receiving special sound signal and video capture device correspond to period acquisition in temporal information is obtained；N number of user behavior characteristics in vision signal are analyzed, and N number of user behavior characteristics are matched with preset criterion behavior feature；If being determined according to matching result comprising speech behavioural characteristic in N number of user behavior characteristics, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；Position where analyzing and determining spokesman described in vision signal in place, the speech position where control audio/video acquisition equipment is directed at the spokesman in place.The disclosure can pass through the identification of spokesman's behavioural characteristic and match the positioning realized to spokesman position.

Description

The voice localization method and device of Behavior-based control identification

Technical field

This disclosure relates to field of computer technology, the voice localization method identified in particular to a kind of Behavior-based control, Device, electronic equipment and computer readable storage medium.

Background technique

In the occasions such as meeting or teaching, quick positioning to spokesman can make corresponding sound or/and video acquisition Device fast and automatically navigate to the spokesman, improve sound or/and video acquisition effect.

However, existing know more demanding, the need to spokesman and other members otherwise based on spokesman's facial characteristics There is the facial characteristics of larger difference, meanwhile, the hardware condition of the video acquisition device of facial characteristics is required also higher；It is based on Multi-microphone positioning or the mode positioned based on spokesman's speech system need to increase a large amount of auxiliary equipment again, increase and match It sets and operating cost.

In the prior art, a kind of voice localization method, device are disclosed application No. is CN 201611131001.5 and is System, wherein method includes: to receive voice messaging by multiple microphones, and judge whether contain the first keyword in voice messaging Voice；If recording the positioning that each microphone receives the first Key word voice containing first Key word voice Information；According to the position coordinates of each microphone and the location information, calculates and issue first Key word voice Sound source position.Voice localization method of the invention, device and system may be implemented to know in multi-person conference occasion or other voices Other occasion, spokesman only need to say Key word voice, so that it may position the direction of spokesman at once, orient pickup sound to realize Sound is conducive to improve the quality for picking up sound.

Application No. is a kind of CN201610304047.6 voices for disclosing combination image to position and enhance system and method, The positioning system includes image recognition tracing subsystem and voice positioning and enhancing subsystem.Image recognition tracing subsystem packet It includes: camera, for acquiring image sequence；Image recognition tracking cell personnel and caches face's three-dimensional coordinate for identification；It is logical The first predefined operation for crossing identification personnel execution wakes up voice positioning and enhancing subsystem, and sends face's three-dimensional coordinate；With Track identifies the personnel, and sends face's three-dimensional coordinate of update.Voice positioning and enhancing subsystem include: microphone array, For acquiring voice messaging；Voice positioning and enhancement unit, for according to spatial filtering algorithms and received face's three-dimensional coordinate Control microphone array directional focusing acquires the voice messaging of the personnel, and according to voice messaging collected to the personnel It is positioned.

Application No. is CN201510066532.X to disclose a kind of positioning of branch's processing formula array voice and Enhancement Method, The design of basic structure, blocking matrix, the design of component filters and external Wiener filtering portion including generalized sidelobe canceller Point.This method uses for reference component structure, and additional postposition Wiener filter ensure that the denoising of algorithm using partial adaptivity technology Performance effectively inhibits noncoherent noise and coherent noise, accelerates convergence speed of the algorithm, reduces computational complexity, phase For the microphone array speech enhancement system of traditional Generalized Sidelobe Canceller, had more using improved speech-enhancement system High output signal-to-noise ratio.Emulation experiment test structure shows relative to based on entirely with the microphone array of generalized sidelobe canceller Speech-enhancement system, method of the invention have higher output signal-to-noise ratio.

Above method is based on the spokesman position that the knowledge of spokesman's facial characteristics positions otherwise or based on multi-microphone Not the problem of method of positioning not can solve and not depend on excessive auxiliary device, and simple and reliable realization spokesman position positions.

Accordingly, it is desirable to provide one or more technical solutions for being at least able to solve the above problem.

It should be noted that information is only used for reinforcing the reason to the background of the disclosure disclosed in above-mentioned background technology part Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.

Summary of the invention

The voice localization method for being designed to provide a kind of identification of Behavior-based control of the disclosure, device, electronic equipment and Computer readable storage medium, and then overcome one caused by the limitation and defect due to the relevant technologies at least to a certain extent A or multiple problems.

According to one aspect of the disclosure, a kind of voice localization method of Behavior-based control identification is provided, comprising:

Signal acquisition step obtains the temporal information for receiving special sound signal when receiving special sound signal, with And video capture device corresponds to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching step, analyzes N number of user behavior characteristics in the vision signal, and by N number of user behavior Feature is matched with preset criterion behavior feature, obtains matching result；

Spokesman determines step, if being determined according to matching result special comprising speech behavior in N number of user behavior characteristics Sign, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；

Speech position positioning step, the position where analyzing spokesman described in vision signal in place, determines the hair Speech position where saying people in place, the speech position where control audio/video acquisition equipment is directed at the spokesman in place It sets.

In a kind of exemplary embodiment of the disclosure, the signal acquisition step, comprising:

Acquire the first audio signal；

Extract the first phonetic feature in first audio signal；

First phonetic feature is matched with the crucial phonetic feature in preset crucial voice characteristic model；

The first phonetic feature of the crucial phonetic feature be will match to as special sound signal.

In a kind of exemplary embodiment of the disclosure, the behavioural characteristic matching step, comprising:

Analyze N number of user behavior characteristics in the vision signal, and by N number of user behavior characteristics and preset standard row It is characterized and is matched；

If judge in N number of user behavior characteristics comprising being greater than with the characteristic matching degree of preset criterion behavior feature or Equal to the user behavior characteristics of preset matching degree, above or equal to preset matching degree user behavior characteristics as speech behavior Feature, the matching result are in N number of user behavior characteristics comprising speech behavioural characteristic；

If the characteristic matching degree of N number of user behavior characteristics and preset criterion behavior feature is respectively less than preset matching Degree, the matching result are not include speech behavioural characteristic in N number of user behavior characteristics.

In a kind of exemplary embodiment of the disclosure, the method also includes:

Determine the site location in the vision signal and the ratio of actual place position；

The site location in vision signal is subjected to position mapping with actual place position according to the ratio；

Region division is carried out to the site location in the vision signal after progress position mapping, is determined each in vision signal The station location marker of corresponding region in region and actual place.

In a kind of exemplary embodiment of the disclosure, in the positioning step of the speech position, institute in vision signal is analyzed Position where stating spokesman in place includes:

Analyze the region of the spokesman in video signals；

Determine the corresponding station location marker in the region of the spokesman in video signals.

The spokesman place is determined in a kind of exemplary embodiment of the disclosure, in the positioning step of the speech position Speech position in place, comprising:

Using the corresponding station location marker in the region of the spokesman in video signals as speech station location marker；

Using the corresponding region of the speech station location marker as speech position, control audio/video is acquired described in equipment alignment Corresponding region of the speech station location marker in actual place.

In a kind of exemplary embodiment of the disclosure, the audio/video acquisition equipment includes at least one microphone.

In a kind of exemplary embodiment of the disclosure, speech position positioning step includes:

According to speech position, the distance between spokesman and microphone are determined；

According to the distance, the audio collection loudness of a sound of microphone is adjusted.

Behind speech position where determining the spokesman in place, the audio collecting device tracking alignment institute is controlled State speech position.

In a kind of exemplary embodiment of the disclosure, which comprises

After determining speech, the audio/video is acquired into device reset.

In a kind of exemplary embodiment of the disclosure, the determining speech end includes:

Obtain the second audio signal of audio/video acquisition equipment acquisition；

Extract the second phonetic feature in second audio signal；

If second phonetic feature, which determines, meets preset conclusion characteristic condition, determine that speech terminates.

In a kind of exemplary embodiment of the disclosure, the speech behavioural characteristic include raise one's hand behavioural characteristic, stand up row It is characterized.

In one aspect of the present disclosure, a kind of voice positioning device of Behavior-based control identification is provided, comprising:

Signal acquisition module, for when receiving special sound signal, obtaining the time letter for receiving special sound signal Breath and video capture device correspond to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching module, for analyzing N number of user behavior characteristics in the vision signal, and by N number of user Behavioural characteristic is matched with preset criterion behavior feature, obtains matching result；

Spokesman's determining module, if for being determined in N number of user behavior characteristics according to matching result comprising speech row It is characterized, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；

Speech position locating module determines institute for the position in place where analyzing spokesman described in vision signal Speech position where stating spokesman in place, the hair where control audio/video acquisition equipment is directed at the spokesman in place Say position.

In one aspect of the present disclosure, a kind of electronic equipment is provided, comprising:

Processor；And

Memory is stored with computer-readable instruction on the memory, and the computer-readable instruction is by the processing The method according to above-mentioned any one is realized when device executes.

In one aspect of the present disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with, institute State realization method according to above-mentioned any one when computer program is executed by processor.

The voice localization method of Behavior-based control identification in the exemplary embodiment of the disclosure is receiving special sound letter Number when, obtain the temporal information for receiving special sound signal and video capture device in the temporal information correspond to the period The vision signal of acquisition；N number of user behavior characteristics in the vision signal are analyzed, and by N number of user behavior characteristics and are preset Criterion behavior feature matched；If being determined according to matching result special comprising speech behavior in N number of user behavior characteristics Sign, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；It analyzes and determines institute in vision signal Position where stating spokesman in place, the speech position where control audio/video acquisition equipment is directed at the spokesman in place It sets.On the one hand, due to the matching characteristic using user behavior characteristics as spokesman, keep identification feature more obvious clear easy Matching, improves match cognization rate；On the other hand, it is risen by special sound signal as the matched enabled label of behavioural characteristic The effect for having arrived auxiliary matched saves matching operation resource, further improves match cognization rate.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

Its example embodiment is described in detail by referring to accompanying drawing, the above and other feature and advantage of the disclosure will become It is more obvious.

Fig. 1 shows the process of the voice localization method according to the identification of the Behavior-based control of one exemplary embodiment of the disclosure Figure；

Fig. 2 shows the voice localization method application scenarios identified according to the Behavior-based control of one exemplary embodiment of the disclosure Schematic diagram；

Fig. 3 shows the voice localization method application scenarios identified according to the Behavior-based control of one exemplary embodiment of the disclosure Schematic diagram；

Fig. 4 shows the voice localization method interactive application identified according to the Behavior-based control of one exemplary embodiment of the disclosure The schematic diagram of scene；

Fig. 5 shows the schematic block of the voice positioning device according to the identification of the Behavior-based control of one exemplary embodiment of the disclosure Figure；

Fig. 6 diagrammatically illustrates the block diagram of the electronic equipment according to one exemplary embodiment of the disclosure；And

Fig. 7 diagrammatically illustrates the schematic diagram of the computer readable storage medium according to one exemplary embodiment of the disclosure.

Specific embodiment

Example embodiment is described more fully with reference to the drawings.However, example embodiment can be real in a variety of forms It applies, and is not understood as limited to embodiment set forth herein；On the contrary, thesing embodiments are provided so that the disclosure will be comprehensively and complete It is whole, and the design of example embodiment is comprehensively communicated to those skilled in the art.Identical appended drawing reference indicates in figure Same or similar part, thus repetition thereof will be omitted.

In addition, described feature, structure or characteristic can be incorporated in one or more implementations in any suitable manner In example.In the following description, many details are provided to provide and fully understand to embodiment of the disclosure.However, It will be appreciated by persons skilled in the art that can be with technical solution of the disclosure without one in the specific detail or more It is more, or can be using other methods, constituent element, material, device, step etc..In other cases, it is not shown in detail or describes Known features, method, apparatus, realization, material or operation are to avoid fuzzy all aspects of this disclosure.

Block diagram shown in the drawings is only functional entity, not necessarily must be corresponding with physically separate entity. I.e., it is possible to realize these functional entitys using software form, or these are realized in the module of one or more softwares hardening A part of functional entity or functional entity, or realized in heterogeneous networks and/or processor device and/or microcontroller device These functional entitys.

In this exemplary embodiment, the voice localization method for providing firstly a kind of Behavior-based control identification, can be applied to The electronic equipments such as computer；With reference to shown in Fig. 1, the voice localization method of Behavior-based control identification be may comprise steps of:

Signal acquisition step S110 obtains the time letter for receiving special sound signal when receiving special sound signal Breath and video capture device correspond to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching step S120, analyzes N number of user behavior characteristics in the vision signal, and by N number of user Behavioural characteristic is matched with preset criterion behavior feature, obtains matching result；

Spokesman determines step S130, if being determined in N number of user behavior characteristics according to matching result comprising speech row It is characterized, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；

Make a speech position positioning step S140, and the position where analyzing spokesman described in vision signal in place determines institute Speech position where stating spokesman in place, the hair where control audio/video acquisition equipment is directed at the spokesman in place Say position.

According to the voice localization method of the Behavior-based control identification in this example embodiment, on the one hand, due to using user's row It is characterized the matching characteristic as spokesman, makes the more obvious clear easy matching of identification feature, improves match cognization rate；Separately On the one hand, the effect of auxiliary matched is played, is saved as the matched enabled label of behavioural characteristic by special sound signal Matching operation resource further improves match cognization rate.

In the following, by the voice localization method of the Behavior-based control identification in this example embodiment is further detailed.

In signal acquisition step S110, it can obtain when receiving special sound signal and receive special sound signal Temporal information and video capture device the temporal information correspond to the period acquisition vision signal.

Voice localization method in this example embodiment, by user's special sound signal, as Behavior-based control identification Initial signal, it is possible to reduce the erroneous judgement to criterion behavior feature after being judged as special sound signal, obtains the special sound The temporal information of signal carries out next step behavioural characteristic to obtain the video image of same time in video capture device Certification.

In this example embodiment, the signal acquisition step, comprising: the first audio signal of acquisition；Extract described first The first phonetic feature in audio signal；By the Key Words in first phonetic feature and preset crucial voice characteristic model Sound feature is matched；The first phonetic feature of the crucial phonetic feature be will match to as special sound signal.The spy Fixed voice signal " * * * is asked to make a speech " " ask the visitor in, and * * * talks " common when can be meeting, is also possible in teaching scene often The significant special sound signal of the common alternative user speech such as " * * * is asked to answer " " classmate to raise one's hand " that sees.

In behavioural characteristic matching step S120, it can analyze N number of user behavior characteristics in the vision signal, and will N number of user behavior characteristics are matched with preset criterion behavior feature, obtain matching result.

In this example embodiment, generally in meeting or teaching scene, multiple users is had, multiple users can be corresponded to Behavioural characteristic identifies these behavioural characteristics and counts number, by the behavioural characteristic and the progress of preset criterion behavior feature Match, obtained matching result can be used to further determine whether as the corresponding behavioural characteristic of spokesman.

In this example embodiment, the behavioural characteristic matching step, comprising: analyze N number of use in the vision signal Family behavioural characteristic, and N number of user behavior characteristics are matched with preset criterion behavior feature；If judging N number of user User behavior in behavioural characteristic comprising the characteristic matching degree with preset criterion behavior feature more than or equal to preset matching degree Feature is used as speech behavioural characteristic above or equal to the user behavior characteristics of preset matching degree, and the matching result is N number of use Include speech behavioural characteristic in the behavioural characteristic of family；If the feature of N number of user behavior characteristics and preset criterion behavior feature Matching degree is respectively less than preset matching degree, and the matching result is not include speech behavioural characteristic in N number of user behavior characteristics.In reality When the user behavior characteristics matching of border scene, the behavioural characteristic of each user can have more or less compared with criterion behavior feature Difference, react such difference with matching degree, and preset matching degree as measure the user behavioural characteristic whether be The standard of criterion behavior feature.

In this example embodiment, according to different application scenarios, the speech behavioural characteristic include raise one's hand behavioural characteristic, It stands up behavioural characteristic.In a kind of schematic diagram of the voice localization method of Behavior-based control identification as shown in Figure 2, in certain field of imparting knowledge to students Jing Zhong, user's behavioural characteristic of raising one's hand are identified and are matched as the schematic diagram of criterion behavior feature.

It is determined in step S130 in spokesman, if can be wrapped with being determined in N number of user behavior characteristics according to matching result The behavioural characteristic containing speech, using user corresponding with the speech behavioural characteristic in vision signal as spokesman.

In this example embodiment, by user behavior characteristics and the criterion behavior characteristic matching, and the use is determined Family behavioural characteristic is after criterion behavior feature, representing the user has speech intention or trend, makes the behavioural characteristic, then The main body of the user behavior characteristics is determined as spokesman.

Position in speech position positioning step S140, where can analyze spokesman described in vision signal in place It sets, the speech position where determining the spokesman in place, it is on the scene that control audio/video acquisition equipment is directed at the spokesman institute Speech position in ground.

In this example embodiment, after determining spokesman, according to modes such as place information or video signal analysis, really Fixed speech position of the spokesman in the place of place, further adjust the volume, right of audio/video acquisition equipment Coke, direction etc. are conducive to the movement of audio/video acquisition.

In this example embodiment, the method also includes: determine the site location and actual field in the vision signal The ratio of position；The site location in vision signal is subjected to position mapping with actual place position according to the ratio；It is right The site location in vision signal after carrying out position mapping carries out region division, determines each region and reality in vision signal The station location marker of corresponding region in place.The predeterminable area in the places such as meeting or teaching is divided, video acquisition can be only passed through The vision signal of equipment acquisition can realize the accurate positionin to spokesman, not need the complicated calculation for occupying a large amount of computing resources Method.

In this example embodiment, in the positioning step of the speech position, the place of spokesman described in vision signal is analyzed Position in place includes: the region of the analysis spokesman in video signals；Determine the spokesman in video signals The corresponding station location marker in region.Position in the region and actual scene of the spokesman in video signals is carried out pair It answers, and generates station location marker, as the corresponding location information of the spokesman.

In this example embodiment, the speech position where determining the spokesman in speech position positioning step in place is stated It sets, comprising: using the corresponding station location marker in the region of the spokesman in video signals as speech station location marker；By the hair The corresponding region of station location marker is sayed as speech position, control audio/video acquisition equipment is directed at the speech station location marker in reality Corresponding region in the place of border.In meeting or teaching scene, audio frequency apparatus is directed at the speech station location marker in reality The speech voice for the acquisition spokesman that corresponding region in place can be more clear；Video equipment is directed at the speech Corresponding region of the station location marker in actual place can carry out the operation such as focus, and then obtain the clearer spokesman's Video pictures.

In this example embodiment, the audio/video acquisition equipment includes at least one microphone.Particularly, some To in the higher scene of audio request, the audio collection array being made of multiple microphones has compared with great number the angle of microphone It is required that the positioning of microphone can be improved the audio collection effect of microphone so realizing by this method, enhancing user's body It tests.As shown in figure 3, passing through user behavior characteristics identification pair in the scene for the audio collection array that multiple microphones form After user location positioning, the multiple microphone is directed to the schematic diagram of the spokesman.

In this example embodiment, speech position positioning step includes: to determine spokesman and wheat according to speech position The distance between gram wind；According to the distance, the audio collection loudness of a sound of microphone is adjusted.Adjusting loudness of a sound is also enhancing user experience An important aspect, such as in simple application scenarios, to the sound compared with distant positions and microphone apart from longer spokesman Frequency loudness of a sound adjusts enhancing, can collect the audio of clearer spokesman.

In this example embodiment, speech position positioning step includes: where determining the spokesman in place Speech position after, control audio collecting device tracking and be directed at the speech position.By video capture device to speech The behavioural characteristic of people matches positioning and tracking to realize spokesman, may be implemented when spokesman dynamic mapping position, Audio/video acquires lasting alignment and tracking of the equipment to the spokesman.

In this example embodiment, which comprises after determining speech, the audio/video is acquired equipment It resets.After spokesman makes a speech solution, by audio/video acquisition device reset to initial position, so as to next hair Speech people is quickly aligned and is tracked.

In this example embodiment, the determining speech terminates to include: the second sound for obtaining audio/video acquisition equipment acquisition Frequency signal；Extract the second phonetic feature in second audio signal；If the second phonetic feature determination meets preset Conclusion characteristic condition determines that speech terminates.The conclusion characteristic condition can be setting preset duration, be greater than if detecting The speech position of preset duration, audio/video acquisition equipment alignment does not all make a sound, then meets preset conclusion feature Condition；Can also be default conclusion feature database, the second phonetic feature matched with conclusion feature database, if matching at Function, then meet preset conclusion characteristic condition, and second phonetic feature can be " my speech terminates " " answer finishes " " speech for thanking to * * * " etc..

In this example embodiment, the method can be applied at the end PC, can also apply on portable handheld device, And data interaction can be realized in above two equipment room.This method is illustrated in figure 4 on the end PC and portable handheld device Using and data interaction scene schematic diagram.

It should be noted that although describing each step of method in the disclosure in the accompanying drawings with particular order, This does not require that or implies must execute these steps in this particular order, or have to carry out step shown in whole Just it is able to achieve desired result.Additional or alternative, it is convenient to omit multiple steps are merged into a step and held by certain steps Row, and/or a step is decomposed into execution of multiple steps etc..

In addition, in this exemplary embodiment, additionally providing a kind of voice positioning device of Behavior-based control identification.Referring to Fig. 5 Shown, the voice positioning device 500 of Behavior-based control identification may include: signal acquisition module 510, behavioural characteristic matching module 520, spokesman's determining module 530 and speech position locating module 540.Wherein:

Signal acquisition module 510, for obtaining the time for receiving special sound signal when receiving special sound signal Information and video capture device correspond to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching module 520, for analyzing N number of user behavior characteristics in the vision signal, and by N number of use Family behavioural characteristic is matched with preset criterion behavior feature, obtains matching result；

Spokesman's determining module 530, if for being determined in N number of user behavior characteristics according to matching result comprising speech Behavioural characteristic, using user corresponding with the speech behavioural characteristic in vision signal as spokesman；

Position locating module 540 of making a speech is determined for the position in place where analyzing spokesman described in vision signal Speech position where the spokesman in place, where control audio/video acquisition equipment is directed at the spokesman in place Speech position.

The detail of the voice positioning device module of each Behavior-based control identification is in corresponding Behavior-based control among the above It is described in detail in the voice localization method of identification, therefore details are not described herein again.

If it should be noted that although being referred to the voice positioning device 500 of Behavior-based control identification in the above detailed description Dry module or unit, but this division is not enforceable.In fact, according to embodiment of the present disclosure, above description Two or more modules or the feature and function of unit can be embodied in a module or unit.Conversely, above One module of description or the feature and function of unit can be to be embodied by multiple modules or unit with further division.

In addition, in an exemplary embodiment of the disclosure, additionally providing a kind of electronic equipment that can be realized the above method.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, completely Software implementation (including firmware, microcode etc.) or hardware and software in terms of combine embodiment, may be collectively referred to as here Circuit, " module " or " system ".

The electronic equipment 600 of this embodiment according to the present invention is described referring to Fig. 6.The electronics that Fig. 6 is shown is set Standby 600 be only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in fig. 6, electronic equipment 600 is showed in the form of universal computing device.The component of electronic equipment 600 can wrap It includes but is not limited to: at least one above-mentioned processing unit 610, at least one above-mentioned storage unit 620, the different system components of connection The bus 630 of (including storage unit 620 and processing unit 610), display unit 640.

Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 610 Row, so that various according to the present invention described in the execution of the processing unit 610 above-mentioned " illustrative methods " part of this specification The step of exemplary embodiment.For example, the processing unit 610 can execute step S110 as shown in fig. 1 to step S140。

Storage unit 620 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit (RAM) 6201 and/or cache memory unit 6202, it can further include read-only memory unit (ROM) 6203.

Storage unit 620 can also include program/utility with one group of (at least one) program module 6205 6204, such program module 6205 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.

Bus 630 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.

Electronic equipment 600 can also be with one or more external equipments 670 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 600 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 600 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 650.Also, electronic equipment 600 can be with By network adapter 660 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.As shown, network adapter 660 is communicated by bus 630 with other modules of electronic equipment 600. It should be understood that although not shown in the drawings, other hardware and/or software module can not used in conjunction with electronic equipment 600, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..

By the description of above embodiment, those skilled in the art is it can be readily appreciated that example embodiment described herein It can also be realized in such a way that software is in conjunction with necessary hardware by software realization.Therefore, implemented according to the disclosure The technical solution of example can be embodied in the form of software products, which can store in a non-volatile memories In medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) or on network, including some instructions are so that a calculating equipment (can To be personal computer, server, terminal installation or network equipment etc.) it executes according to the method for the embodiment of the present disclosure.

In an exemplary embodiment of the disclosure, a kind of computer readable storage medium is additionally provided, energy is stored thereon with Enough realize the program product of this specification above method.In some possible embodiments, various aspects of the invention can be with It is embodied as a kind of form of program product comprising program code, it is described when described program product is run on the terminal device Program code is for executing the terminal device described in above-mentioned " illustrative methods " part of this specification according to the present invention The step of various exemplary embodiments.

Refering to what is shown in Fig. 7, the program product 700 for realizing the above method of embodiment according to the present invention is described, It can using portable compact disc read only memory (CD-ROM) and including program code, and can in terminal device, such as It is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing, which can be, appoints What include or the tangible medium of storage program that the program can be commanded execution system, device or device use or and its It is used in combination.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal, Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its The program of combined use.

The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have Line, optical cable, RF etc. or above-mentioned any appropriate combination.

The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).

In addition, above-mentioned attached drawing is only the schematic theory of processing included by method according to an exemplary embodiment of the present invention It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim It points out.

It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the attached claims.

Claims

1. a kind of voice localization method of Behavior-based control identification, which is characterized in that the described method includes:

Signal acquisition step obtains the temporal information for receiving special sound signal, and view when receiving special sound signal Frequency acquisition equipment corresponds to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching step, analyzes N number of user behavior characteristics in the vision signal, and by N number of user behavior characteristics It is matched with preset criterion behavior feature, obtains matching result；

Spokesman determines step, will if being determined according to matching result comprising speech behavioural characteristic in N number of user behavior characteristics User corresponding with speech behavioural characteristic is as spokesman in vision signal；

Speech position positioning step, analyzes the position in vision signal where spokesman in place, determines place where spokesman In speech position, the speech position where control audio/video acquisition equipment is directed at the spokesman in place.

2. the method as described in claim 1, which is characterized in that the signal acquisition step, comprising:

Acquire the first audio signal；

Extract the first phonetic feature in first audio signal；

3. the method as described in claim 1, which is characterized in that the behavioural characteristic matching step, comprising:

N number of user behavior characteristics in the vision signal are analyzed, and N number of user behavior characteristics and preset criterion behavior are special Sign is matched；

If judging to be greater than or equal in N number of user behavior characteristics comprising the characteristic matching degree with preset criterion behavior feature The user behavior characteristics of preset matching degree, it is special as speech behavior above or equal to the user behavior characteristics of preset matching degree Sign, the matching result are in N number of user behavior characteristics comprising speech behavioural characteristic；

If the characteristic matching degree of N number of user behavior characteristics and preset criterion behavior feature is respectively less than preset matching degree, institute Stating matching result is not include speech behavioural characteristic in N number of user behavior characteristics.

4. the method as described in claim 1, which is characterized in that the method also includes:

Region division is carried out to the site location in the vision signal after progress position mapping, determines each region in vision signal With the station location marker of corresponding region in actual place.

5. method as claimed in claim 4, which is characterized in that in the positioning step of the speech position, analyze in vision signal Position where the spokesman in place includes:

Analyze the region of the spokesman in video signals；

6. method as claimed in claim 5, which is characterized in that determine the spokesman institute in the positioning step of the speech position Speech position in place, comprising:

The corresponding region of the station location marker that will make a speech is directed at the speech position as speech position, control audio/video acquisition equipment Identify the corresponding region in actual place.

7. the method as described in claim 1, which is characterized in that the audio/video acquisition equipment includes at least one microphone.

8. the method for claim 7, which is characterized in that speech position positioning step includes:

9. the method as described in claim 1, which is characterized in that speech position positioning step includes:

Behind speech position where determining the spokesman in place, control audio/video acquisition equipment tracking is directed at the hair Say position.

10. the method as described in claim 1, which is characterized in that the described method includes:

After determining speech, the audio/video is acquired into device reset.

11. method as claimed in claim 10, which is characterized in that the determining speech terminate include:

Extract the second phonetic feature in second audio signal；

12. the method as described in claim 1, which is characterized in that the speech behavioural characteristic include raise one's hand behavioural characteristic, stand up Behavioural characteristic.

13. a kind of voice positioning device of Behavior-based control identification, which is characterized in that described device includes:

Signal acquisition module, for obtaining the temporal information for receiving special sound signal when receiving special sound signal, with And video capture device corresponds to the vision signal of period acquisition in the temporal information；

Behavioural characteristic matching module, for analyzing N number of user behavior characteristics in the vision signal, and by N number of user behavior Feature is matched with preset criterion behavior feature, obtains matching result；

Spokesman's determining module, if special comprising speech behavior in N number of user behavior characteristics for being determined according to matching result Sign, using user corresponding with speech behavioural characteristic in vision signal as spokesman；

Speech position locating module determines spokesman place for analyzing the position in vision signal where spokesman in place Speech position in place, the speech position where control audio/video acquisition equipment is directed at the spokesman in place.

14. a kind of electronic equipment, which is characterized in that including

Processor；And

Memory is stored with computer-readable instruction on the memory, and the computer-readable instruction is held by the processor Method according to any one of claim 1 to 12 is realized when row.

15. a kind of computer readable storage medium, is stored thereon with computer program, the computer program is executed by processor Shi Shixian is according to claim 1 to any one of 12 the methods.