CN110248197A

CN110248197A - Sound enhancement method and device

Info

Publication number: CN110248197A
Application number: CN201810185895.9A
Authority: CN
Inventors: 陈扬坤; 钱能锋; 陈展
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2019-09-17
Anticipated expiration: 2038-03-07
Also published as: CN110248197B

Abstract

This application discloses a kind of sound enhancement method and devices, belong to multi-media processing field.The described method includes: obtaining target image, target image includes N number of image-region；When receiving the predetermined registration operation in N number of image-region on object region, object space corresponding with object region direction is determined, and speech enhan-cement processing is carried out to the corresponding voice signal in object space direction.The application carries out auditory localization by object region specified by predetermined registration operation according to user by speech-enhancement system, so that the object space direction oriented is the direction of enhancing voice required for user, to improve the accuracy of auditory localization and the quality of enhanced voice signal, the performance of speech-enhancement system is provided significantly.

Description

Sound enhancement method and device

Technical field

The invention relates to multi-media processing field, in particular to a kind of sound enhancement method and device.

Background technique

Sound enhancement method refers to the method for extracting useful voice signal from ambient noise to reduce noise jamming.

Currently, the sound enhancement method includes: that video camera utilizes by taking the sound enhancement method based on microphone array as an example The respective collected sound signal of multiple microphones is carried out according to the space phase information that collected multiple voice signals respectively contain Space filtering forms the spatial beams with pointing direction, to enhance the voice signal on assigned direction.

But in the above-mentioned methods, when in use environment there are when multiple voice signals or larger ambient noise, due to Video camera, which generally selects the strongest voice signal of sound, to be enhanced, it is therefore more likely that will lead to voice signal and the use of enhancing The inconsistent situation of the voice signal enhanced is actually needed in person.

Summary of the invention

In order to solve the problems, such as that auditory localization inaccuracy during Speech enhancement in the related technology, the embodiment of the present application provide A kind of sound enhancement method and device.The technical solution is as follows:

In a first aspect, providing a kind of sound enhancement method, which comprises

The target image of video collection area is obtained, the target image includes N number of image-region, and the N is greater than 1 Positive integer；

When receiving the predetermined registration operation in N number of image-region on object region, the determining and target figure As the corresponding object space direction in region, the object space direction is used to indicate the space side for needing to carry out speech enhan-cement processing To；

Speech enhan-cement processing is carried out to the corresponding voice signal in the object space direction.

Optionally, described when receiving the predetermined registration operation in the target image on object region, determining and institute State the corresponding object space direction of object region, comprising:

When receiving the predetermined registration operation in the target image, the corresponding image-region of the predetermined registration operation is determined as The object region；

According to the first default corresponding relationship, the corresponding direction in space of the object region is determined as object space side To the first default corresponding relationship includes the corresponding relationship between described image region and the direction in space.

It is optionally, described that speech enhan-cement processing is carried out to the corresponding voice signal in the object space direction, comprising:

Speech enhan-cement processing is carried out to the voice signal from the object space direction, and to from non-targeted sky Between direction voice signal carry out voice suppression processing；

Wherein, the non-targeted direction in space is its in addition to the object space direction in the video collection area Its direction in space.

According to the second default corresponding relationship, target local space corresponding with the object space direction is determined, described the Two default corresponding relationships include the corresponding relationship between the direction in space and local space；

Speech enhan-cement processing is carried out to the voice signal from the target local space, and to from non-targeted office The voice signal in portion space carries out voice suppression processing；

Wherein, the non-targeted local space is its in addition to the target local space in the video collection area Its space.

Optionally, the video collection area includes M different shooting areas, and the M is the positive integer greater than 1, institute State the target image for obtaining video collection area, comprising:

Obtain the corresponding shooting image of the M shooting area；

The M shooting image is spliced, the target image is obtained.

Second aspect, provides a kind of speech sound enhancement device, and described device includes:

Module is obtained, for obtaining the target image of video collection area, the target image includes N number of image-region, The N is the positive integer greater than 1；

Determining module, for determining when receiving the predetermined registration operation in N number of image-region on object region Object space corresponding with object region direction, the object space direction, which is used to indicate, to need to carry out speech enhan-cement The direction in space of processing；

Enhance module, for carrying out speech enhan-cement processing to the corresponding voice signal in the object space direction.

Optionally, the determining module is also used to when receiving the predetermined registration operation in N number of image-region, by institute It states the corresponding image-region of predetermined registration operation and is determined as the object region；According to the first default corresponding relationship, by the mesh The corresponding direction in space in logo image region is determined as object space direction, and the first default corresponding relationship includes described image area Corresponding relationship between domain and direction in space.

Optionally, the enhancing module is also used to carry out voice to the voice signal from the object space direction Enhancing processing, and voice suppression processing is carried out to the voice signal from non-targeted direction in space；

Wherein, the non-targeted direction in space is other direction in spaces in addition to the object space direction.

Optionally, the enhancing module is also used to according to the second default corresponding relationship, the determining and object space direction Corresponding target local space, the second default corresponding relationship include the corresponding pass between the direction in space and local space System；Speech enhan-cement processing is carried out to the voice signal from the target local space, and to empty from non-targeted part Between voice signal carry out voice suppression processing；

Optionally, the video collection area includes M different shooting areas, and the M is the positive integer greater than 1, institute Acquisition module is stated, is also used to obtain the corresponding shooting image of the M shooting area；The M shooting image is carried out Splicing, obtains the target image.

The third aspect provides a kind of video camera, and the video camera includes processor and memory, deposits in the memory Contain at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Cheng Sequence, the code set or instruction set are loaded by the processor and are executed to realize as any one in first aspect and first aspect Sound enhancement method provided by the possible implementation of kind.

Fourth aspect provides a kind of terminal, and the terminal includes processor and memory, is stored in the memory At least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, institute Code set or instruction set is stated to be loaded by the processor and executed to realize that any one in first aspect and first aspect such as can Sound enhancement method provided by the implementation of energy.

5th aspect, provide a kind of speech-enhancement system, the system comprises video camera and terminal, the video camera with The terminal is connected, and the video camera includes at least three cameras and at least six microphones,

The terminal, for obtaining the target image of video collection area, the target image includes N number of image-region, The N is the positive integer greater than 1；

The terminal is also used to when receiving the predetermined registration operation in N number of image-region on object region, really Fixed object space corresponding with object region direction, the object space direction, which is used to indicate, to need to carry out voice increasing The direction in space of strength reason；

The terminal or the video camera, for carrying out voice increasing to the corresponding voice signal in the object space direction Strength reason.

6th aspect, provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or Instruction set is loaded by the processor and is executed to realize such as any one possible realization side in first aspect and first aspect Sound enhancement method provided by formula.

Technical solution provided by the embodiments of the present application has the benefit that

Target image is obtained by speech-enhancement system, target image includes N number of image-region；When receiving N number of image When predetermined registration operation in region on object region, object space corresponding with object region direction is determined, and to mesh It marks the corresponding voice signal of direction in space and carries out speech enhan-cement processing；Speech-enhancement system is passed through according to user default The specified object region of operation carries out auditory localization, so that the object space direction oriented is needed for user Enhance the direction of voice, to improve the accuracy of auditory localization and the quality of enhanced voice signal, provides significantly The performance of speech-enhancement system.

Detailed description of the invention

Fig. 1 is the structural schematic diagram for the speech-enhancement system that one exemplary embodiment of the application provides；

Fig. 2 is the structural schematic diagram of video camera in the speech-enhancement system of one exemplary embodiment of the application offer；

Fig. 3 is the flow chart for the sound enhancement method that one exemplary embodiment of the application provides；

Fig. 4 is the flow chart for the sound enhancement method that another exemplary embodiment of the application provides；

Fig. 5 is the division for the video collection area that the sound enhancement method that one exemplary embodiment of the application provides is related to The schematic diagram of mode；

Fig. 6 is the division mode for the target image that the sound enhancement method that one exemplary embodiment of the application provides is related to Schematic diagram；

Fig. 7 is the schematic illustration for the sound enhancement method that one exemplary embodiment of the application provides；

Fig. 8 is the structure chart for the speech sound enhancement device that one exemplary embodiment of the application provides；

Fig. 9 is the structural block diagram for the terminal that one exemplary embodiment of the application provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.

Referring to FIG. 1, the structural representation of the speech-enhancement system provided it illustrates one exemplary embodiment of the application Figure.The speech-enhancement system includes: video camera 120 and terminal 140.

Video camera 120 includes at least one camera and microphone array, and video camera 120 is used to image by least one Head obtains the target image of video collection area, and acquires various voice signals by microphone array.

Optionally, M camera is provided in video camera 120, it is corresponding, video collection area is divided into M difference Shooting area, there are one-to-one relationship, video camera 120 is used to pass through M camera for each camera and shooting area The shooting image of corresponding shooting area is acquired, and M shooting image is spliced, obtains target image.That is, target Image includes the corresponding shooting image of M shooting area, and M is the positive integer greater than 1.Target image may be considered panorama Image or wide angle picture.

Wherein, intersection is not present in M different shooting areas or there are intersections there are at least two.

Optionally, video collection area is border circular areas, there are at least one region is fan-shaped region in M shooting area Or M shooting area is fan-shaped region.

Optionally, microphone array is classified as annular microphone array, which includes at least six microphones.

In the following, being only illustrated so that video camera 120 includes three cameras and eight microphones as an example.Schematically, it asks With reference to the structural schematic diagram of video camera 120 shown in Fig. 2.The video camera 120 includes three cameras 122 and eight microphones 124。

Three cameras 122 are respectively the first camera 122, second camera 122 and third camera 122.

For these three cameras 122 relative to origin scattering device, origin refers to the position of the central point of video camera 120, takes the photograph Camera 120 establishes coordinate system according to the origin.

Optionally, a method of establishing coordinate system are as follows: using the central point of video camera as origin, central point is directed toward default Direction is positive direction of the y-axis, and the direction for being perpendicularly oriented to right side with y-axis is positive direction of the x-axis.The present embodiment is combined with this coordinate system schemes 2 are illustrated.The present embodiment is not construed as limiting the method for establishing coordinate system.

Respectively a corresponding shooting area, each camera 122 are used to acquire corresponding shooting area three cameras 122 Shooting image.Optionally, the first camera 122 is used to acquire the shooting image of the first shooting area, and the first shooting area is It is in 0 ° to 120 ° corresponding region with positive direction of the y-axis；Second camera 122 is used to acquire the shooting image of the second shooting area, It is in 120 ° to 240 ° corresponding regions that second shooting area, which is with positive direction of the y-axis,；Third camera 122 is for acquiring third shooting The shooting image in region, it is in 240 ° to 360 ° corresponding regions that third shooting area, which is with positive direction of the y-axis,.

The present embodiment is not limited the value range of the first default angle and the second default angle, below only with first Default angle and the second default angle are to be illustrated for 120 degree.

Optionally, eight microphones 124 are relative to origin scattering device, in eight microphones 124 between every section of microphone Distance be equal perhaps the distance between every section of microphone be differ or exist at least two sections of microphones it Between distance be equal.

Optionally, in eight microphones 124 any four microphone 124 in the same plane, or exist at least four A microphone 124 in the same plane, or there are at least four microphones 124 not in the same plane.

Wherein, the type of camera and microphone can be fixed in video camera 120, be also possible to rotation.

It should be noted that the present embodiment is not limited the position of camera and microphone and type.

Video camera 120 is used to obtain the target image of video collection area, and the target image that will acquire is sent to end End 140.Corresponding, terminal 140 receives the target image.

Optionally, video camera 120 is established by wireless network or cable network and terminal 140 and is communicated to connect.

Terminal 120 is that have the terminal of display screen, for example, mobile phone, tablet computer, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, pocket computer on knee and desktop computer etc..

Optionally, display screen is liquid crystal display or OLED display screen；Schematically, liquid crystal display includes STN (Super Twisted Nematic, super twisted nematic) screen, UFB (Ultra Fine Bright) screen, TFD (Thin Film Diode, thin film diode) screen, at least one in TFT (Thin Film Transistor thin film transistor (TFT)) screen Kind.

In general, terminal 140 receives the target image that video camera 120 is sent, and the target image is shown on a display screen.When When terminal 140 receives the predetermined registration operation in target image on object region, mesh corresponding with object region is determined Direction in space is marked, speech enhan-cement processing is carried out to the corresponding voice signal in object space direction.

Optionally, above-mentioned wireless network or cable network use standard communication techniques and/or agreement.Network be usually because Special net, it may also be any network, including but not limited to local area network (Local Area Network, LAN), Metropolitan Area Network (MAN) (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or nothing Any combination of gauze network, dedicated network or Virtual Private Network).In some embodiments, using including hypertext markup Language (Hyper Text Mark-up Language, HTML), extensible markup language (Extensible Markup Language, XML) etc. technology and/or format represent the data by network exchange.It additionally can be used such as safe Socket layer (Secure Socket Layer, SSL), Transport Layer Security (Transport Layer Security, TLS), void Quasi- dedicated network (Virtual Private Network, VPN), Internet Protocol Security (Internet Protocol Security, IPsec) etc. conventional encryption techniques encrypt all or some links.In further embodiments, can also make Replace or supplement above-mentioned data communication technology with customization and/or the exclusive data communication technology.

Referring to FIG. 3, the flow chart of the sound enhancement method provided it illustrates one exemplary embodiment of the application.This Embodiment is applied to speech-enhancement system shown in figure 1 with the sound enhancement method to illustrate.The sound enhancement method Include:

Step 301, the target image of video collection area is obtained, target image includes N number of image-region, and N is greater than 1 Positive integer.

Optionally, it includes: video camera acquisition video acquisition that Speech enhancement system, which obtains the target image of video collection area, The target image in region, and collected target image is sent to terminal, corresponding, terminal receives target image.

Wherein, video camera acquires the target image of video collection area in real time, or interval video is adopted at predetermined time intervals Collect the target image in region, target image is used to indicate the ambient enviroment of video camera.

Optionally, video collection area is pre-set for acquiring the region of target image, the video collection area Whole region or default regional area including scene locating for video camera.

When video collection area is whole region, target image is the panoramic picture of whole region；When video acquisition area When domain is default regional area, target image is the topography of default regional area.Below only using target image as panorama sketch It is illustrated as.

Optionally, target image is divided into N number of figure according to default division rule after getting target image by terminal As region, which is used to indicate the number of the image-region of division and the area size of each image-region.

It wherein, is identical there are the area size of at least two image-regions in N number of image-region, alternatively, in the presence of extremely The area size of few two image-regions is different or the area size of any two image-region is identical.Under Face, only by the area size of N number of image-region be it is identical for be illustrated.

It includes but is not limited to that following two can that target image is divided into N number of image-region according to default division rule by terminal The division mode of energy:

Target image is divided into N number of image district according to the quantity of shooting area by the first possible division mode, terminal Domain, the corresponding shooting area of each image-region.

Target image is divided into M regional area by second of possible division mode, terminal, and each regional area is corresponding One shooting area, for each regional area after dividing, terminal again divides the regional area further progress, by the office Portion's region division is K image-region, i.e., target image is divided into altogether M*K image-region, and K is the positive integer greater than 1. The present embodiment is not limited the value of image-region.Below only with second of possible division mode, the value of M is 3, K Value be 8, i.e., target image include 24 image-regions for be illustrated.Specific division mode can refer to implementation below Associated description in example, wouldn't introduce herein.

Step 302, when receiving the predetermined registration operation in N number of image-region on object region, determining and target figure As the corresponding object space direction in region, object space direction is used to indicate the direction in space for needing to carry out speech enhan-cement processing.

Optionally, it when Speech enhancement system receives the predetermined registration operation in target image on object region, determines Object space corresponding with object region direction, comprising: after terminal gets target image, displaying target on a display screen The corresponding image-region of predetermined registration operation is determined as target figure when terminal receives the predetermined registration operation in target image by image As region, object space corresponding with object region direction is determined.

Predetermined registration operation is the user's operation for determining object region in N number of image-region.Schematically, it presets Operation includes the combination of any one or more in clicking operation, slide, pressing operation, long press operation.

In other possible implementations, predetermined registration operation can also be realized with speech form.For example, user in the terminal with Speech form inputs the corresponding presupposed information of object region, after target terminal gets voice signal, believes the voice It number carries out parsing and obtains voice content, when the pass that there is presupposed information corresponding with object region in voice content and match When key words, i.e., terminal determines the corresponding object region of the presupposed information.

For terminal according to the object region determined and the first default corresponding relationship, determination is corresponding with object region Object space direction, the first default corresponding relationship includes the corresponding relationship between image-region and direction in space.Terminal determines The process in object space direction can refer to the associated description in following example, not introduce first herein.

Wherein, direction in space can be indicated with space angle or space angle section.Space angle is and above-mentioned foundation Coordinate system in positive direction of the y-axis angle.

Optionally, being formed by angle with positive direction of the y-axis according to clockwise direction is negative angle, according to counter clockwise direction and y It is positive angle that axis positive direction, which is formed by angle,.This implementation is not limited the representation of direction in space.

Schematically, target image includes 24 image-regions, when terminal receives the predetermined registration operation in target image, The corresponding object region A of predetermined registration operation is determined in 24 image-regions, according to the first default corresponding relationship determination and mesh The corresponding object space direction logo image region A is 30 °.

Step 303, speech enhan-cement processing is carried out to the corresponding voice signal in object space direction.

Speech enhancement system carries out speech enhan-cement processing to the corresponding voice signal in object space direction, including but not limited to The possible implementation of following two:

The first possible implementation: terminal obtains the voice signal set of video collection area, to voice signal collection The corresponding voice signal in object space direction carries out speech enhan-cement processing in conjunction.

Optionally, video camera acquires the voice signal set of video collection area by microphone array, by voice signal Set is sent to terminal, corresponding, and terminal receives the voice signal set of video camera transmission.Terminal is to from object space The voice signal in direction carries out speech enhan-cement processing.

Second of possible implementation: video camera receives the object space direction that terminal is sent, to collected target The corresponding voice signal of direction in space carries out speech enhan-cement processing.

Optionally, when terminal determines object space direction, the object space direction is sent to video camera, it is corresponding, Video camera receives the object space direction, carries out speech enhan-cement processing to the voice signal from object space direction.Camera shooting The process that machine carries out speech enhan-cement processing to the voice signal from object space direction can refer to the phase in following example Details is closed, is not introduced first herein.

Schematically, when terminal determine object space direction be 30 ° when, video camera to from positive direction of the y-axis be in 30 ° The voice signal in direction carries out speech enhan-cement processing.

It should be noted that step 302 and step 303 can be implemented separately as a kind of sound localization method, the sound source Localization method is usually completed by terminal, for the determining object space direction for needing to carry out speech enhan-cement processing；Step 303 can Become a kind of sound enhancement method to be implemented separately, which is usually completed by terminal or video camera, is used for root The object space direction determined according to step 202 and step 203 carries out voice to the voice signal from object space direction Enhancing processing.In the following, only completing sound localization method with terminal, and video camera is completed to be illustrated for sound enhancement method.

In conclusion the embodiment of the present application obtains target image by speech-enhancement system, target image includes N number of image Region；When receiving the predetermined registration operation in N number of image-region on object region, determination is corresponding with object region Object space direction, and speech enhan-cement processing is carried out to the corresponding voice signal in object space direction；So that speech-enhancement system Object region specified by predetermined registration operation capable of being passed through according to user carries out auditory localization, so that the target oriented Direction in space is the direction of enhancing voice required for user, thus improve auditory localization accuracy and enhanced sound The quality of signal provides the performance of speech-enhancement system significantly.

Referring to FIG. 4, the flow chart of the sound enhancement method provided it illustrates another exemplary embodiment of the application. The present embodiment is applied to illustrate in speech-enhancement system shown in FIG. 1 with the sound enhancement method.This method includes

Step 401, video camera obtains the corresponding shooting image of M shooting area.

It is stored in terminal included by the angular interval and video collection area of pre-set video collection area The corresponding angular interval of M shooting area, for each shooting area, video camera acquires the shooting by a camera The shooting image in region.

Schematically, as shown in figure 5, the angular interval of video collection area is [- 180,180], video collection area packet Include three shooting areas, respectively shooting area 11 (angular interval be (0,120]), shooting area 12 (angular interval be (- 180, -120] and (120,180]) and shooting area 13 (angular interval be (- 120,0]), video camera includes the first camera, the There are one-to-one relationships for two cameras and third camera, three cameras and three shooting areas.In synchronization, take the photograph Camera acquires the shooting image 1 of shooting area 11 by the first camera, and second camera acquires the shooting figure of shooting area 12 As 2, second camera acquires the shooting image 3 of shooting area 13.

Step 402, video camera splices M shooting image, obtains target image.

Optionally, video camera is according to the sequence of positions of shooting area, by the corresponding shooting image of M shooting area into Row splicing, obtains target image.

Schematically, be based on video collection area shown in fig. 5, terminal by the shooting image 1 of shooting area 11, shooting area The shooting image 2 in domain 12 and the shooting image 3 of shooting area 13 are successively spliced, and target image is obtained.

Step 403, target image is sent to terminal by video camera.

The target image that splicing obtains is sent to terminal by video camera, corresponding, and terminal receives target image.

Step 404, terminal receives simultaneously displaying target image.

The mode of terminal displaying target image includes but is not limited to the possible implementation of following two:

The first possible implementation: when terminal receives the target image of video camera transmission, directly on a display screen Show the target image.

Second of possible implementation: when terminal receives the target image of video camera transmission, according to shooting area Quantity is divided target image to obtain the corresponding shooting image of M shooting area, shows M simultaneously on a display screen A shooting image, or successively show M shooting image.It is checked in order to facilitate user, it is only possible with the first below It is illustrated for implementation.

Step 405, when terminal receives the predetermined registration operation in target image, the corresponding image-region of predetermined registration operation is true It is set to object region.

Optionally, target image is divided into N number of image-region according to above-mentioned second possible division mode by terminal, when When terminal receives the predetermined registration operation in target image, the corresponding image-region of predetermined registration operation is determined in N number of image-region For object region.

Schematically, as shown in fig. 6, target image is divided into three regional areas by terminal, respectively and first partial Region, the second regional area and third regional area, the corresponding shooting area of each regional area, for each of after dividing The regional area is further divided into 8 image-regions by regional area, terminal, i.e. first partial region includes image-region A1 To image-region H1, the second regional area includes image-region A2 to image-region H2, and third regional area includes image-region A3 to image-region H3, so that target image is divided into altogether 24 image-regions.When terminal is received to image-region A1 Clicking operation when, image-region A1 is determined as object region.

Step 406, the corresponding direction in space of object region is determined as mesh according to the first default corresponding relationship by terminal Direction in space is marked, the first default corresponding relationship includes the corresponding relationship between image-region and direction in space.

Optionally, the default corresponding relationship of first be stored in terminal between image-region and direction in space.When terminal is true When making object region, object space corresponding with object region direction is determined according to the first default corresponding relationship.

Wherein, direction in space can be indicated with space angle or space angle section.In order to reduce data storage capacity, under Face is only illustrated so that direction in space is indicated with space angle as an example.

Schematically, the division mode based on Fig. 6 target image provided, first between image-region and direction in space Default corresponding relationship is as shown in Table 1.

Table one

Image-region	Direction in space	Image-region	Direction in space	Image-region	Direction in space
						A1	15°	A2	135°	A3	-120°
B1	30°	B2	150°	B3	-105°
						C1	45°	C2	165°	C3	-90°
D1	60°	D2	180°	D3	-75°
						E1	75°	E2	-180°	E3	-60°
F1	90°	F2	-165°	F3	-45°
						G1	105°	G2	-150°	G3	-30°
H1	120°	H2	-135°	H3	-15°

For example, being preset after image-region A1 is determined as object region by terminal according to first that above-mentioned table one provides Corresponding relationship determines that the corresponding object space direction object region A1 is " 15 ° ".

Step 407, object space direction is sent to video camera by terminal.

The object space direction determined is sent to video camera by terminal, corresponding, and video camera receives terminal transmission Object space direction.

Step 408, video camera carries out speech enhan-cement processing to the corresponding voice signal in object space direction.

Video camera believes the corresponding sound in object space direction by built-in microphone array collected sound signal set Number speech enhan-cement processing is carried out, the including but not limited to possible implementation of following two:

The first possible implementation, video camera carry out speech enhan-cement to the voice signal from object space direction Processing, and voice suppression processing is carried out to the voice signal from non-targeted direction in space.Wherein, non-targeted direction in space is Other direction in spaces in addition to object space direction.

It schematically, will be from 15 ° of sides when the object space direction that video camera receives terminal transmission is " 15 ° " To voice signal carry out speech enhan-cement processing, to from other direction in spaces in addition to 15 ° voice signal carry out language Sound inhibition processing.

Second of possible implementation, video camera is according to the second default corresponding relationship, determining and object space direction pair The target local space answered, the second default corresponding relationship include the corresponding relationship between direction in space and local space；To coming from In target local space voice signal carry out speech enhan-cement processing, and to the voice signal from non-targeted local space into The processing of row voice suppression.

Wherein, non-targeted local space is other spaces in video collection area in addition to target local space.

Optionally, video camera constructs the corresponding three-dimensional space of video collection area according at least one camera in advance, and Three-dimensional space is divided into N number of local space, second be stored between direction in space and local space in video camera is default to close System.Wherein, local space refers to the three-dimensional space of the part under scene locating for video camera.

Schematically, three-dimensional space is divided into 24 local spaces, i.e. local space A4 to local space H4, office in advance Portion space A5 to local space H5 and local space A6 to local space H6, the direction in space and local space stored in video camera Between the second preset relation it is as shown in Table 2.

Table two

Direction in space	Local space	Direction in space	Local space	Direction in space	Local space
						15°	A4	135°	A5	-120°	A6
30°	B4	150°	B5	-105°	B6
						45°	C4	165°	C5	-90°	C6
60°	D4	180°	D5	-75°	D6
						75°	E4	-180°	E5	-60°	E6
90°	F4	-165°	F5	-45°	F6
						105°	G4	-150°	G5	-30°	G6
120°	H4	-135°	H5	-15°	H6

It should be noted that three-dimensional space is video acquisition since target image is video collection area corresponding image The corresponding space in region, then the division mode and terminal that three-dimensional space is divided into N number of local space by video camera are by target image The division mode for being divided into N number of image-region can be corresponding, be also possible to not corresponding.When two division modes are corresponding When, corresponding relationship existing for image-region and local space, each image-region local space corresponding with the image-region Space angle range is identical.

Schematically, as shown in fig. 7, when the object space direction that video camera receives terminal transmission is " 15 ° ", according to The second default corresponding relationship that above-mentioned table two provides determines that local space A4 corresponding with object space direction " 15 ° " is target Local space 71, the corresponding space angle range of target local space 71 be (0,15 °], video camera will be empty from target part Between 71 voice signal carry out speech enhan-cement processing, to the sound from other local spaces in addition to target local space 71 Sound signal carries out voice suppression processing.

Optionally, video camera carries out language to the corresponding voice signal in object space direction by adaptive beam-forming algorithm Sound enhancing processing, output obtain enhanced voice signal.

Wherein, adaptive beam-forming algorithm includes minimum variance distortionless response (Minimum Variance Distortionless Response, MVDR), Generalized Sidelobe Canceller (Generalized Sidelobe Canceller, ) and transmission function Generalized Sidelobe Canceller (Transfer Function Generalized Sidelobe GSC At least one of Canceller, TF-GSC).

In conclusion the embodiment of the present application will be also by that will preset when terminal receives the predetermined registration operation in target image It operates corresponding image-region and is determined as object region, it is according to the first default corresponding relationship, object region is corresponding Direction in space be determined as object space direction；It enables the terminal to according to specified object region, it is pre- by first If corresponding relationship determines corresponding object space direction, avoid when, there are when multi-acoustical, generally selecting sound most in environment Strong voice signal leads to the situation of auditory localization mistake as object space direction, ensure that the accuracy of auditory localization.

The embodiment of the present application is and right also by carrying out speech enhan-cement processing to the voice signal from object space direction Voice signal from non-targeted direction in space carries out voice suppression processing, effectively reduces the influence of ambient noise, greatly The noise robustness of speech-enhancement system is improved greatly.

Following is the application Installation practice, can be used for executing the application embodiment of the method.It is real for the application device Undisclosed details in example is applied, the application embodiment of the method is please referred to.

Referring to FIG. 8, the structural representation of the speech sound enhancement device provided it illustrates one exemplary embodiment of the application Figure.The speech sound enhancement device can be by special hardware circuit, alternatively, being implemented in combination with for software and hardware increases as the language in Fig. 1 Strong system all or part of, the speech sound enhancement device include: obtain module 810, determining module 820 and enhancing module 830.

Module 810 is obtained, for obtaining the target image of video collection area, target image includes N number of image-region, N For the positive integer greater than 1；

Determining module 820, for determining when receiving the predetermined registration operation in N number of image-region on object region Object space corresponding with object region direction, object space direction are used to indicate the sky for needing to carry out speech enhan-cement processing Between direction；

Enhance module 830, for carrying out speech enhan-cement processing to the corresponding voice signal in object space direction.

Optionally, determining module 820 are also used to when receiving the predetermined registration operation in N number of image-region, by predetermined registration operation Corresponding image-region is determined as object region；According to the first default corresponding relationship, by the corresponding sky of object region Between direction be determined as object space direction, the first default corresponding relationship includes the corresponding pass between image-region and direction in space System.

Optionally, enhance module 830, be also used to carry out at speech enhan-cement to from the voice signal in object space direction Reason, and voice suppression processing is carried out to the voice signal from non-8 object space direction；

Wherein, non-targeted direction in space is other direction in spaces in addition to object space direction.

Optionally, enhance module 830, be also used to according to the second default corresponding relationship, determination is corresponding with object space direction Target local space, the second default corresponding relationship includes the corresponding relationship between direction in space and local space；To from The voice signal of target local space carries out speech enhan-cement processing, and to the voice signal progress from non-targeted local space Voice suppression processing；

Optionally, video collection area includes M different shooting areas, and M is the positive integer greater than 1, obtains module 810, it is also used to obtain the corresponding shooting image of M shooting area；M shooting image is spliced, target figure is obtained Picture.

Optionally, which includes video camera and terminal, and video camera is connected with terminal, and video camera includes at least three camera shootings Head and at least six microphones.

The embodiment of the present application also provides a kind of video camera, video camera includes processor and memory, is stored in memory Have at least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, code set or Instruction set is loaded by processor and is executed to realize the sound enhancement method provided in above-mentioned each embodiment of the method.

The embodiment of the present application also provides a kind of terminal, terminal includes processor and memory, be stored in memory to Few an instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, code set or instruction Collection is loaded by processor and is executed to realize the sound enhancement method provided in above-mentioned each embodiment of the method.

The embodiment of the present application also provides a kind of computer readable storage medium, at least one finger is stored in storage medium It enables, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, code set or instruction set are by handling Device is loaded and is executed to realize the sound enhancement method provided in above-mentioned each embodiment of the method.

Fig. 9 shows the structural block diagram of the terminal 900 of one exemplary embodiment of the application offer.The terminal 900 is upper State the terminal being connected in speech-enhancement system with video camera.For example, terminal 900 is smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, laptop or desktop computer.Terminal 900 is also possible to referred to as user equipment, portable terminal, end on knee Other titles such as end, terminal console.

In general, terminal 900 includes: processor 901 and memory 902.

Processor 901 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 901 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 901 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 901 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 901 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 902 may include one or more computer readable storage mediums, which can To be non-transient.Memory 902 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 902 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 901 for realizing this Shen Please in the sound enhancement method that provides of each embodiment of the method.

In some embodiments, terminal 900 is also optional includes: peripheral device interface 903 and at least one peripheral equipment. It can be connected by bus or signal wire between processor 901, memory 902 and peripheral device interface 903.Each peripheral equipment It can be connected by bus, signal wire or circuit board with peripheral device interface 903.Specifically, peripheral equipment includes: radio circuit 904, at least one of touch display screen 905, camera 906, voicefrequency circuit 907, positioning component 908 and power supply 909.

Peripheral device interface 903 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 901 and memory 902.In some embodiments, processor 901, memory 902 and peripheral equipment Interface 903 is integrated on same chip or circuit board；In some other embodiments, processor 901, memory 902 and outer Any one or two in peripheral equipment interface 903 can realize on individual chip or circuit board, the present embodiment to this not It is limited.

Radio circuit 904 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.It penetrates Frequency circuit 904 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 904 turns electric signal It is changed to electromagnetic signal to be sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 904 wraps It includes: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, codec chip Group, user identity module card etc..Radio circuit 904 can be carried out by least one wireless communication protocol with other terminals Communication.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, it penetrates Frequency circuit 904 can also include NFC (Near Field Communication, wireless near field communication) related circuit, this Application is not limited this.

Display screen 905 is for showing UI (User Interface, user interface).The UI may include figure, text, figure Mark, video and its their any combination.When display screen 905 is touch display screen, display screen 905 also there is acquisition to show The ability of the touch signal on the surface or surface of screen 905.The touch signal can be used as control signal and be input to processor 901 are handled.At this point, display screen 905 can be also used for providing virtual push button and/or dummy keyboard, also referred to as soft button and/or Soft keyboard.In some embodiments, display screen 905 can be one, and the front panel of terminal 900 is arranged；In other embodiments In, display screen 905 can be at least two, be separately positioned on the different surfaces of terminal 900 or in foldover design；In still other reality It applies in example, display screen 905 can be flexible display screen, be arranged on the curved surface of terminal 900 or on fold plane.Even, it shows Display screen 905 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 905 can use LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) Etc. materials preparation.

CCD camera assembly 906 is for acquiring image or video.Optionally, CCD camera assembly 906 include front camera and Rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.One In a little embodiments, rear camera at least two is main camera, depth of field camera, wide-angle camera, focal length camera shooting respectively Any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide-angle Camera fusion realizes that pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are clapped Camera shooting function.In some embodiments, CCD camera assembly 906 can also include flash lamp.Flash lamp can be monochromatic warm flash lamp, It is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for not With the light compensation under colour temperature.

Voicefrequency circuit 907 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and will Sound wave, which is converted to electric signal and is input to processor 901, to be handled, or is input to radio circuit 904 to realize voice communication. For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 900 to be multiple.Mike Wind can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 901 or radio circuit will to be come from 904 electric signal is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramic loudspeaker.When When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, it can also be by telecommunications Number the sound wave that the mankind do not hear is converted to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 907 can also include Earphone jack.

Positioning component 908 is used for the current geographic position of positioning terminal 900, to realize navigation or LBS (Location Based Service, location based service).Positioning component 908 can be the GPS (Global based on the U.S. Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group Part.

Power supply 909 is used to be powered for the various components in terminal 900.Power supply 909 can be alternating current, direct current, Disposable battery or rechargeable battery.When power supply 909 includes rechargeable battery, which can be wired charging electricity Pond or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is by wireless The battery of coil charges.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, terminal 900 further includes having one or more sensors 910.The one or more sensors 910 include but is not limited to: acceleration transducer 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, Optical sensor 915 and proximity sensor 916.

The acceleration that acceleration transducer 911 can detecte in three reference axis of the coordinate system established with terminal 900 is big It is small.For example, acceleration transducer 911 can be used for detecting component of the acceleration of gravity in three reference axis.Processor 901 can With the acceleration of gravity signal acquired according to acceleration transducer 911, touch display screen 905 is controlled with transverse views or longitudinal view Figure carries out the display of user interface.Acceleration transducer 911 can be also used for the acquisition of game or the exercise data of user.

Gyro sensor 912 can detecte body direction and the rotational angle of terminal 900, and gyro sensor 912 can To cooperate with acquisition user to act the 3D of terminal 900 with acceleration transducer 911.Processor 901 is according to gyro sensor 912 Following function may be implemented in the data of acquisition: when action induction (for example changing UI according to the tilt operation of user), shooting Image stabilization, game control and inertial navigation.

The lower layer of side frame and/or touch display screen 905 in terminal 900 can be set in pressure sensor 913.Work as pressure When the side frame of terminal 900 is arranged in sensor 913, user can detecte to the gripping signal of terminal 900, by processor 901 Right-hand man's identification or prompt operation are carried out according to the gripping signal that pressure sensor 913 acquires.When the setting of pressure sensor 913 exists When the lower layer of touch display screen 905, the pressure operation of touch display screen 905 is realized to UI circle according to user by processor 901 Operability control on face is controlled.Operability control includes button control, scroll bar control, icon control, menu At least one of control.

Fingerprint sensor 914 is used to acquire the fingerprint of user, collected according to fingerprint sensor 914 by processor 901 The identity of fingerprint recognition user, alternatively, by fingerprint sensor 914 according to the identity of collected fingerprint recognition user.It is identifying When the identity of user is trusted identity out, the user is authorized to execute relevant sensitive operation, the sensitive operation packet by processor 901 Include solution lock screen, check encryption information, downloading software, payment and change setting etc..Terminal can be set in fingerprint sensor 914 900 front, the back side or side.When being provided with physical button or manufacturer Logo in terminal 900, fingerprint sensor 914 can be with It is integrated with physical button or manufacturer Logo.

Optical sensor 915 is for acquiring ambient light intensity.In one embodiment, processor 901 can be according to optics The ambient light intensity that sensor 915 acquires controls the display brightness of touch display screen 905.Specifically, when ambient light intensity is higher When, the display brightness of touch display screen 905 is turned up；When ambient light intensity is lower, the display for turning down touch display screen 905 is bright Degree.In another embodiment, the ambient light intensity that processor 901 can also be acquired according to optical sensor 915, dynamic adjust The acquisition parameters of CCD camera assembly 906.

Proximity sensor 916, also referred to as range sensor are generally arranged at the front panel of terminal 900.Proximity sensor 916 For acquiring the distance between the front of user Yu terminal 900.In one embodiment, when proximity sensor 916 detects use When family and the distance between the front of terminal 900 gradually become smaller, touch display screen 905 is controlled from bright screen state by processor 901 It is switched to breath screen state；When proximity sensor 916 detects user and the distance between the front of terminal 900 becomes larger, Touch display screen 905 is controlled by processor 901 and is switched to bright screen state from breath screen state.

It will be understood by those skilled in the art that the restriction of the not structure paired terminal 900 of structure shown in Fig. 9, can wrap It includes than illustrating more or fewer components, perhaps combine certain components or is arranged using different components.

It should be understood that speech sound enhancement device provided by the above embodiment is when carrying out speech enhan-cement, only with above-mentioned each The division progress of functional module can according to need and for example, in practical application by above-mentioned function distribution by different function Energy module is completed, i.e., the internal structure of equipment is divided into different functional modules, to complete whole described above or portion Divide function.In addition, sound enhancement method provided by the above embodiment and Installation practice belong to same design, implemented Journey is detailed in embodiment of the method, and which is not described herein again.

Above-mentioned the embodiment of the present application serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely the preferred embodiments of the application, not to limit the application, it is all in spirit herein and Within principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of sound enhancement method, which is characterized in that the described method includes:

The target image of video collection area is obtained, the target image includes N number of image-region, and the N is just whole greater than 1 Number；

When receiving the predetermined registration operation in N number of image-region on object region, the determining and object-image region The corresponding object space direction in domain, the object space direction are used to indicate the direction in space for needing to carry out speech enhan-cement processing；

2. the method according to claim 1, wherein described ought receive target figure in N number of image-region When as predetermined registration operation on region, object space corresponding with object region direction is determined, comprising:

When receiving the predetermined registration operation in N number of image-region, the corresponding image-region of the predetermined registration operation is determined as The object region；

According to the first default corresponding relationship, the corresponding direction in space of the object region is determined as object space direction, The first default corresponding relationship includes the corresponding relationship between described image region and the direction in space.

3. the method according to claim 1, wherein described to the corresponding voice signal in the object space direction Carry out speech enhan-cement processing, comprising:

Speech enhan-cement processing is carried out to the voice signal from the object space direction, and to from non-targeted space side To voice signal carry out voice suppression processing；

4. the method according to claim 1, wherein described to the corresponding voice signal in the object space direction Carry out speech enhan-cement processing, comprising:

According to the second default corresponding relationship, target local space corresponding with the object space direction is determined, described second is pre- If corresponding relationship includes the corresponding relationship between the direction in space and local space；

Speech enhan-cement processing is carried out to the voice signal from the target local space, and to empty from non-targeted part Between voice signal carry out voice suppression processing；

Wherein, the non-targeted local space is other skies in the video collection area in addition to the target local space Between.

5. method according to any one of claims 1 to 4, which is characterized in that the video collection area includes M different Shooting area, the M are the positive integer greater than 1, the target image for obtaining video collection area, comprising:

Obtain the corresponding shooting image of the M shooting area；

The M shooting image is spliced, the target image is obtained.

6. a kind of speech sound enhancement device, which is characterized in that described device includes:

Module is obtained, for obtaining the target image of video collection area, the target image includes N number of image-region, the N For the positive integer greater than 1；

Determining module, for when receiving the predetermined registration operation in N number of image-region on object region, determining and institute The corresponding object space direction of object region is stated, the object space direction, which is used to indicate, to need to carry out speech enhan-cement processing Direction in space；

7. device according to claim 6, which is characterized in that the determining module is also used to that N number of figure ought be received When as predetermined registration operation in region, the corresponding image-region of the predetermined registration operation is determined as the object region；According to The corresponding direction in space of the object region is determined as object space direction by the first default corresponding relationship, and described first Default corresponding relationship includes the corresponding relationship between described image region and direction in space.

8. device according to claim 6, which is characterized in that the enhancing module is also used to from the target The voice signal of direction in space carries out speech enhan-cement processing, and carries out voice to the voice signal from non-targeted direction in space Inhibition processing；

9. device according to claim 6, which is characterized in that the enhancing module is also used to according to the second default correspondence Relationship determines target local space corresponding with the object space direction, and the second default corresponding relationship includes the sky Between corresponding relationship between direction and local space；Speech enhan-cement is carried out to the voice signal from the target local space Processing, and voice suppression processing is carried out to the voice signal from non-targeted local space；

10. according to any device of claim 6 to 9, which is characterized in that the video collection area includes M difference Shooting area, the M is positive integer greater than 1, and it is respectively right to be also used to obtain the M shooting area for the acquisition module The shooting image answered；The M shooting image is spliced, the target image is obtained.

11. a kind of video camera, which is characterized in that the video camera includes processor and memory, is stored in the memory At least one instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, institute Code set or instruction set is stated to be loaded by the processor and executed to realize speech enhan-cement as claimed in claim 1 to 5 Method.

12. a kind of terminal, which is characterized in that the terminal includes processor and memory, is stored at least in the memory One instruction, at least a Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the generation Code collection or instruction set are loaded by the processor and are executed to realize sound enhancement method as claimed in claim 1 to 5.

13. a kind of speech-enhancement system, which is characterized in that the system comprises video camera and terminal, the video camera with it is described Terminal is connected, and the video camera includes at least three cameras and at least six microphones,

The terminal, for obtaining the target image of video collection area, the target image includes N number of image-region, the N For the positive integer greater than 1；

The terminal is also used to when receiving the predetermined registration operation in N number of image-region on object region, determine with The corresponding object space direction of the object region, the object space direction, which is used to indicate, to need to carry out at speech enhan-cement The direction in space of reason；

The terminal or the video camera, for being carried out at speech enhan-cement to the corresponding voice signal in the object space direction Reason.

14. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium A few Duan Chengxu, code set or instruction set, at least one instruction, an at least Duan Chengxu, the code set or instruction Collection is loaded by the processor and is executed to realize sound enhancement method as claimed in claim 1 to 5.