CN110248197B - Voice enhancement method and device - Google Patents

Voice enhancement method and device Download PDF

Info

Publication number
CN110248197B
CN110248197B CN201810185895.9A CN201810185895A CN110248197B CN 110248197 B CN110248197 B CN 110248197B CN 201810185895 A CN201810185895 A CN 201810185895A CN 110248197 B CN110248197 B CN 110248197B
Authority
CN
China
Prior art keywords
target
space
image
target image
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810185895.9A
Other languages
Chinese (zh)
Other versions
CN110248197A (en
Inventor
陈扬坤
钱能锋
陈展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201810185895.9A priority Critical patent/CN110248197B/en
Publication of CN110248197A publication Critical patent/CN110248197A/en
Application granted granted Critical
Publication of CN110248197B publication Critical patent/CN110248197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23412Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Studio Devices (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application discloses a voice enhancement method and a voice enhancement device, and belongs to the field of multimedia processing. The method comprises the following steps: acquiring a target image, wherein the target image comprises N image areas; when preset operation on a target image area in the N image areas is received, a target space direction corresponding to the target image area is determined, and voice enhancement processing is carried out on a sound signal corresponding to the target space direction. According to the method and the device, the sound source is positioned through the voice enhancement system according to the target image area appointed by the user through preset operation, and then the positioned target space direction is the direction of the voice required to be enhanced by the user, so that the accuracy of sound source positioning and the quality of the enhanced sound signal are improved, and the performance of the voice enhancement system is greatly improved.

Description

Voice enhancement method and device
Technical Field
The embodiment of the application relates to the field of multimedia processing, in particular to a voice enhancement method and device.
Background
The speech enhancement method is a method of extracting a useful sound signal from environmental noise to reduce noise interference.
Currently, taking a microphone array-based speech enhancement method as an example, the speech enhancement method includes: the camera collects sound signals by using a plurality of microphones respectively, and performs spatial filtering according to spatial phase information contained in the collected sound signals respectively to form a spatial beam with a pointing direction, so as to enhance the sound signals in the designated direction.
However, in the above method, when there are a plurality of sound signals in the use environment or the environmental noise is large, since the camera usually selects the sound signal with the strongest sound for enhancement, there is a possibility that the enhanced sound signal does not coincide with the sound signal that the user actually needs to enhance.
Disclosure of Invention
In order to solve the problem of inaccurate sound source positioning in the language enhancement process in the related art, the embodiment of the application provides a voice enhancement method and a voice enhancement device. The technical scheme is as follows:
in a first aspect, a speech enhancement method is provided, the method comprising:
acquiring a target image of a video acquisition area, wherein the target image comprises N image areas, and N is a positive integer greater than 1;
when preset operation on a target image area in the N image areas is received, determining a target space direction corresponding to the target image area, wherein the target space direction is used for indicating a space direction needing voice enhancement processing;
and carrying out voice enhancement processing on the sound signal corresponding to the target space direction.
Optionally, when a preset operation on a target image area in the target image is received, determining a target spatial direction corresponding to the target image area includes:
when a preset operation in the target image is received, determining an image area corresponding to the preset operation as the target image area;
and determining a space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the first preset corresponding relation comprises a corresponding relation between the image area and the space direction.
Optionally, the performing, by the speech enhancement processing on the sound signal corresponding to the target spatial direction, includes:
carrying out voice enhancement processing on the sound signals from the target space direction, and carrying out voice suppression processing on the sound signals from the non-target space direction;
wherein the non-target spatial direction is a spatial direction other than the target spatial direction in the video acquisition area.
Optionally, the performing, by the speech enhancement processing on the sound signal corresponding to the target spatial direction, includes:
determining a target local space corresponding to the target space direction according to a second preset corresponding relation, wherein the second preset corresponding relation comprises a corresponding relation between the space direction and the local space;
carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
wherein the non-target local space is other space in the video acquisition area except the target local space.
Optionally, the video capturing area includes M different shooting areas, where M is a positive integer greater than 1, and the acquiring a target image of the video capturing area includes:
acquiring shot images corresponding to the M shooting areas respectively;
and splicing the M shot images to obtain the target image.
In a second aspect, there is provided a speech enhancement apparatus, the apparatus comprising:
the acquisition module is used for acquiring a target image of a video acquisition area, wherein the target image comprises N image areas, and N is a positive integer greater than 1;
a determining module, configured to determine, when a preset operation on a target image area in the N image areas is received, a target spatial direction corresponding to the target image area, where the target spatial direction is used to indicate a spatial direction in which voice enhancement processing needs to be performed;
and the enhancement module is used for carrying out voice enhancement processing on the sound signal corresponding to the target space direction.
Optionally, the determining module is further configured to determine, when a preset operation in the N image areas is received, an image area corresponding to the preset operation as the target image area; and determining the space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the first preset corresponding relation comprises the corresponding relation between the image area and the space direction.
Optionally, the enhancement module is further configured to perform speech enhancement processing on the sound signal from the target spatial direction, and perform speech suppression processing on the sound signal from the non-target spatial direction;
wherein the non-target spatial direction is a spatial direction other than the target spatial direction.
Optionally, the enhancing module is further configured to determine a target local space corresponding to the target space direction according to a second preset corresponding relationship, where the second preset corresponding relationship includes a corresponding relationship between the space direction and the local space; carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
wherein the non-target local space is other space in the video acquisition area except the target local space.
Optionally, the video acquisition area includes M different shooting areas, where M is a positive integer greater than 1, and the obtaining module is further configured to obtain shooting images corresponding to the M shooting areas respectively; and splicing the M shot images to obtain the target image.
In a third aspect, a camera is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech enhancement method as provided in any one of the first aspect and the first possible implementation manner.
In a fourth aspect, a terminal is provided, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech enhancement method according to the first aspect or any one of the possible implementations of the first aspect.
In a fifth aspect, a speech enhancement system is provided, the system comprising a camera and a terminal, the camera being connected to the terminal, the camera comprising at least three cameras and at least six microphones,
the terminal is used for acquiring a target image of a video acquisition area, wherein the target image comprises N image areas, and N is a positive integer greater than 1;
the terminal is further configured to determine a target spatial direction corresponding to a target image area when a preset operation on the target image area in the N image areas is received, where the target spatial direction is used to indicate a spatial direction in which voice enhancement processing is required;
and the terminal or the camera is used for carrying out voice enhancement processing on the sound signal corresponding to the target space direction.
A sixth aspect provides a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the speech enhancement method as provided in the first aspect and any one of the possible implementations of the first aspect.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
acquiring a target image through a voice enhancement system, wherein the target image comprises N image areas; when preset operation on a target image area in the N image areas is received, determining a target space direction corresponding to the target image area, and performing voice enhancement processing on a sound signal corresponding to the target space direction; the voice enhancement system can perform sound source positioning according to the target image area appointed by the user through preset operation, and further the positioned target space direction is the direction of the voice required to be enhanced by the user, so that the accuracy of sound source positioning and the quality of the enhanced sound signal are improved, and the performance of the voice enhancement system is greatly improved.
Drawings
FIG. 1 is a schematic block diagram of a speech enhancement system provided by an exemplary embodiment of the present application;
FIG. 2 is a schematic diagram of a camera in a speech enhancement system according to an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a method of speech enhancement provided by an exemplary embodiment of the present application;
FIG. 4 is a flow chart of a method of speech enhancement provided by another exemplary embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a manner in which a video capture area is partitioned according to a speech enhancement method provided by an exemplary embodiment of the present application;
FIG. 6 is a diagram illustrating a division manner of a target image involved in a speech enhancement method according to an exemplary embodiment of the present application;
FIG. 7 is a schematic diagram of a speech enhancement method provided by an exemplary embodiment of the present application;
FIG. 8 is a block diagram of a speech enhancement apparatus provided in an exemplary embodiment of the present application;
fig. 9 is a block diagram of a terminal according to an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a schematic structural diagram of a speech enhancement system according to an exemplary embodiment of the present application is shown. The speech enhancement system comprises: a camera 120 and a terminal 140.
The camera 120 includes at least one camera and a microphone array, and the camera 120 is configured to acquire a target image of a video acquisition area through the at least one camera and acquire various sound signals through the microphone array.
Optionally, M cameras are arranged in the camera 120, the video acquisition area is correspondingly divided into M different shooting areas, each camera has a one-to-one correspondence with the shooting area, and the camera 120 is configured to acquire the shooting images of the corresponding shooting areas through the M cameras, and splice the M shooting images to obtain the target image. That is, the target image includes captured images corresponding to M captured regions, M being a positive integer greater than 1. The target image may be considered a panoramic image or a wide-angle image.
Wherein, there is not the intersection or exists at least two and intersects in M different shooting areas.
Optionally, the video capture area is a circular area, and at least one of the M shooting areas is a sector area or all of the M shooting areas are sector areas.
Optionally, the microphone array is a ring-shaped microphone array, which includes at least six microphones.
In the following, the camera 120 including three cameras and eight microphones will be described as an example. Please refer to the schematic structure of the camera 120 shown in fig. 2. The camera 120 includes three cameras 122 and eight microphones 124.
The three cameras 122 are a first camera 122, a second camera 122, and a third camera 122, respectively.
The three cameras 122 are arranged in a dispersed manner with respect to an origin, which is a position of a center point of the camera 120, from which the camera 120 establishes a coordinate system.
Optionally, a method for establishing a coordinate system includes: the central point of the camera is used as the original point, the direction in which the central point points to the preset direction is the positive direction of the y axis, and the direction which is perpendicular to the y axis and points to the right side is the positive direction of the x axis. The present embodiment is illustrated with the coordinate system in conjunction with fig. 2. The method for establishing the coordinate system is not limited in this embodiment.
The three cameras 122 respectively correspond to one shooting area, and each camera 122 is used for acquiring a shot image of the corresponding shooting area. Optionally, the first camera 122 is configured to collect a shot image of a first shot area, where the first shot area is an area corresponding to the positive direction of the y-axis by 0 ° to 120 °; the second camera 122 is configured to acquire a shot image of a second shot area, where the second shot area corresponds to the positive direction of the y-axis at an angle of 120 ° to 240 °; the third camera 122 is configured to capture a captured image of a third captured area, where the third captured area corresponds to a positive y-axis direction at 240 ° to 360 °.
In this embodiment, the value ranges of the first preset included angle and the second preset included angle are not limited, and the following description only takes the first preset included angle and the second preset included angle as 120 degrees as an example.
Optionally, the eight microphones 124 are distributed with respect to the origin, and distances between each of the eight microphones 124 are equal, or distances between each of the eight microphones are unequal, or there is at least one distance between two microphones that is equal.
Optionally, any four microphones 124 of the eight microphones 124 are on the same plane, or there are at least four microphones 124 not on the same plane.
The camera 120 may be fixed or rotatable in terms of the types of camera and microphone.
It should be noted that, in this embodiment, neither the position nor the type of the camera nor the microphone is limited.
The camera 120 is configured to acquire a target image of a video capture area and send the acquired target image to the terminal 140. Correspondingly, the terminal 140 receives the target image.
Optionally, the camera 120 establishes a communication connection with the terminal 140 through a wireless network or a wired network.
The terminal 120 is a terminal having a display screen, such as a mobile phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a laptop computer, a desktop computer, and the like.
Optionally, the display screen is a liquid crystal display screen or an OLED display screen; illustratively, the liquid crystal display panel includes at least one of a STN (Super Twisted Nematic) screen, a ufb (ultra Film bright) screen, a TFD (Thin Film Diode) screen, and a TFT (Thin Film Transistor) screen.
Generally, the terminal 140 receives a target image transmitted from the camera 120 and displays the target image on a display screen. When the terminal 140 receives a preset operation on a target image area in a target image, a target space direction corresponding to the target image area is determined, and a voice enhancement process is performed on a voice signal corresponding to the target space direction.
Optionally, the wireless or wired networks described above use standard communication techniques and/or protocols. The Network is typically the Internet, but may be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a mobile, wireline or wireless Network, a private Network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.
Referring to fig. 3, a flow chart of a speech enhancement method provided by an exemplary embodiment of the present application is shown. The embodiment is exemplified by applying the speech enhancement method to the speech enhancement system shown in fig. 1. The speech enhancement method comprises the following steps:
step 301, acquiring a target image of a video acquisition area, where the target image includes N image areas, and N is a positive integer greater than 1.
Optionally, the acquiring, by the language enhancement system, the target image of the video capture area includes: the camera collects a target image of a video collection area, the collected target image is sent to the terminal, and correspondingly, the terminal receives the target image.
The camera acquires a target image of a video acquisition area in real time or at preset time intervals, and the target image is used for indicating the surrounding environment of the camera.
Optionally, the video capture area is a preset area for capturing the target image, and the video capture area includes a whole area or a preset local area of a scene where the camera is located.
When the video acquisition area is the whole area, the target image is a panoramic image of the whole area; and when the video acquisition area is the preset local area, the target image is a local image of the preset local area. The following description will be given only by taking the target image as a panoramic image.
Optionally, after the terminal acquires the target image, the terminal divides the target image into N image regions according to a preset division rule, where the preset division rule is used to indicate the number of divided image regions and the region size of each image region.
The size of the at least two image areas in the N image areas is the same, or the size of the at least two image areas is different, or the size of any two image areas is the same. In the following, the description will be given only by taking an example in which the area sizes of the N image areas are the same.
The terminal divides the target image into N image areas according to the preset division rule, including but not limited to the following two possible division modes:
in a first possible dividing manner, the terminal divides the target image into N image areas according to the number of the shooting areas, and each image area corresponds to one shooting area.
In a second possible dividing manner, the terminal divides the target image into M local regions, each local region corresponds to one shooting region, and for each divided local region, the terminal further divides the local region and divides the local region into K image regions, that is, the target image is divided into M × K image regions in total, where K is a positive integer greater than 1. The value of the image area is not limited in this embodiment. In the following, only the second possible division manner is described as an example, where the value of M is 3, and the value of K is 8, that is, the target image includes 24 image areas. The specific division can refer to the related description in the following embodiments, which will not be introduced here.
Step 302, when receiving a preset operation on a target image area in the N image areas, determining a target spatial direction corresponding to the target image area, where the target spatial direction is used to indicate a spatial direction in which a voice enhancement process is required.
Optionally, when the language enhancement system receives a preset operation on a target image region in the target image, determining a target spatial direction corresponding to the target image region includes: the method comprises the steps that after a terminal acquires a target image, the target image is displayed on a display screen, when the terminal receives a preset operation in the target image, an image area corresponding to the preset operation is determined as a target image area, and a target space direction corresponding to the target image area is determined.
The preset operation is a user operation for determining a target image area among the N image areas. Illustratively, the preset operation includes any one or a combination of a click operation, a slide operation, a press operation, and a long press operation.
In other possible implementations, the preset operation may also be implemented in a voice form. For example, a user inputs preset information corresponding to a target image area in a voice form in a terminal, after the target terminal acquires a voice signal, the voice signal is analyzed to acquire voice content, and when a keyword matched with the preset information corresponding to the target image area exists in the voice content, the terminal determines the target image area corresponding to the preset information.
And the terminal determines a target space direction corresponding to the target image area according to the determined target image area and a first preset corresponding relation, wherein the first preset corresponding relation comprises the corresponding relation between the image area and the space direction. The procedure of the terminal determining the target spatial direction may refer to the related description in the following embodiments, which will not be introduced first.
The spatial direction may be represented by a spatial angle or a spatial angle interval. The spatial angle is an angle from the positive direction of the y-axis in the coordinate system established above.
Optionally, an included angle formed between the clockwise direction and the positive direction of the y axis is a negative angle, and an included angle formed between the counterclockwise direction and the positive direction of the y axis is a positive angle. The spatial direction is not limited in the present embodiment.
Illustratively, the target image includes 24 image areas, when the terminal receives a preset operation in the target image, a target image area a corresponding to the preset operation is determined in the 24 image areas, and a target spatial direction corresponding to the target image area a is determined to be 30 ° according to the first preset correspondence.
Step 303, performing speech enhancement processing on the sound signal corresponding to the target spatial direction.
The speech enhancement system performs speech enhancement processing on the sound signal corresponding to the target spatial direction, including but not limited to the following two possible implementation manners:
a first possible implementation: the terminal acquires a sound signal set of a video acquisition area, and performs voice enhancement processing on a sound signal corresponding to a target space direction in the sound signal set.
Optionally, the camera collects a sound signal set of the video collection area through the microphone array, and sends the sound signal set to the terminal, and correspondingly, the terminal receives the sound signal set sent by the camera. The terminal performs speech enhancement processing on the sound signal from the target spatial direction.
A second possible implementation: and the camera receives the target space direction sent by the terminal and performs voice enhancement processing on the sound signal corresponding to the acquired target space direction.
Optionally, when the terminal determines the target spatial direction, the target spatial direction is sent to the camera, and correspondingly, the camera receives the target spatial direction and performs the voice enhancement processing on the sound signal from the target spatial direction. The process of the camera performing speech enhancement processing on the sound signal from the target spatial direction can refer to the following embodiments, which will not be described herein.
Illustratively, when the terminal determines that the target spatial direction is 30 °, the camera performs speech enhancement processing on the sound signal from a direction of 30 ° from the positive direction of the y-axis.
It should be noted that, step 302 and step 303 can be implemented separately as a sound source localization method, which is usually performed by a terminal, for determining a target spatial direction to be subjected to a speech enhancement process; step 303 may be implemented separately as a speech enhancement method, which is usually performed by a terminal or a camera, for performing speech enhancement processing on the sound signal from the target spatial direction according to the target spatial direction determined in steps 202 and 203. In the following, the method of sound source localization by the terminal and the method of voice enhancement by the camera will be described as examples.
To sum up, in the embodiment of the present application, a target image is obtained through a voice enhancement system, and the target image includes N image areas; when preset operation on a target image area in the N image areas is received, determining a target space direction corresponding to the target image area, and performing voice enhancement processing on a sound signal corresponding to the target space direction; the voice enhancement system can perform sound source positioning according to the target image area appointed by the user through preset operation, and further the positioned target space direction is the direction of the voice required to be enhanced by the user, so that the accuracy of sound source positioning and the quality of the enhanced sound signal are improved, and the performance of the voice enhancement system is greatly improved.
Referring to fig. 4, a flowchart of a speech enhancement method provided by another exemplary embodiment of the present application is shown. The embodiment is exemplified by applying the speech enhancement method to the speech enhancement system shown in fig. 1. The method comprises
In step 401, a camera acquires shot images corresponding to M shot areas.
The terminal stores a preset angle interval of a video acquisition area and angle intervals corresponding to M shooting areas included in the video acquisition area, and for each shooting area, a camera acquires a shooting image of the shooting area through a camera.
Illustratively, as shown in fig. 5, the angle interval of the video capture area is [ -180, 180], the video capture area includes three capture areas, namely, a capture area 11 (angle interval is (0, 120]), a capture area 12 (angle interval is (-180, -120] and (120, 180]), and a capture area 13 (angle interval is (-120, 0]), the camera includes a first camera, a second camera, and a third camera, the three cameras and the three capture areas have a one-to-one correspondence relationship, and at the same time, the camera captures a capture image 1 of the capture area 11 through the first camera, the second camera captures a capture image 2 of the capture area 12, and the second camera captures a capture image 3 of the capture area 13.
And step 402, splicing the M shot images by the camera to obtain a target image.
Optionally, the camera splices the shot images corresponding to the M shot areas according to the position sequence of the shot areas to obtain the target image.
Schematically, based on the video capture area shown in fig. 5, the terminal sequentially splices the captured image 1 of the capture area 11, the captured image 2 of the capture area 12, and the captured image 3 of the capture area 13 to obtain a target image.
In step 403, the camera sends the target image to the terminal.
And the camera sends the spliced target image to the terminal, and correspondingly, the terminal receives the target image.
In step 404, the terminal receives and displays the target image.
The method for displaying the target image by the terminal includes but is not limited to the following two possible implementation methods:
a first possible implementation: and when the terminal receives the target image sent by the camera, the target image is directly displayed on the display screen.
A second possible implementation: when the terminal receives the target images sent by the camera, the target images are divided according to the number of the shooting areas to obtain shooting images corresponding to the M shooting areas, and the M shooting images are displayed on the display screen at the same time or are sequentially displayed. For the convenience of the user to view, only the first possible implementation manner is described below as an example.
Step 405, when the terminal receives a preset operation in the target image, determining an image area corresponding to the preset operation as the target image area.
Optionally, the terminal divides the target image into N image areas according to the second possible division manner, and when the terminal receives a preset operation in the target image, determines an image area corresponding to the preset operation in the N image areas as the target image area.
Illustratively, as shown in fig. 6, the terminal divides the target image into three local regions, respectively a first local region, a second local region and a third local region, each local region corresponding to one photographing region, and for each of the divided local regions, the terminal further divides the local region into 8 image regions, i.e., the first local region includes image region a1 to image region H1, the second local region includes image region a2 to image region H2, and the third local region includes image region A3 to image region H3, thereby dividing the target image into 24 image regions in total. When the terminal receives a click operation on the image region a1, the image region a1 is determined as a target image region.
Step 406, the terminal determines the spatial direction corresponding to the target image area as the target spatial direction according to a first preset corresponding relationship, where the first preset corresponding relationship includes a corresponding relationship between the image area and the spatial direction.
Optionally, the terminal stores a first preset corresponding relationship between the image area and the spatial direction. And when the terminal determines the target image area, determining a target space direction corresponding to the target image area according to the first preset corresponding relation.
The spatial direction may be represented by a spatial angle or a spatial angle interval. In order to reduce the amount of data storage, the following description will be given only by taking the spatial direction as an example of a spatial angle.
Illustratively, based on the dividing manner of the target image provided in fig. 6, a first preset corresponding relationship between the image area and the spatial direction is shown in table one.
Watch 1
Image area Direction of space Image area Direction of space Image area Direction of space
A1 15° A2 135° A3 -120°
B1 30° B2 150° B3 -105°
C1 45° C2 165° C3 -90°
D1 60° D2 180° D3 -75°
E1 75° E2 -180° E3 -60°
F1 90° F2 -165° F3 -45°
G1 105° G2 -150° G3 -30°
H1 120° H2 -135° H3 -15°
For example, after the terminal determines the image area a1 as the target image area, it determines that the target spatial direction corresponding to the target image area a1 is "15 °", according to the first preset corresponding relationship provided in the table one.
Step 407, the terminal sends the target space direction to the camera.
And the terminal sends the determined target space direction to the camera, and correspondingly, the camera receives the target space direction sent by the terminal.
In step 408, the camera performs speech enhancement processing on the sound signal corresponding to the target spatial direction.
The camera collects a sound signal set through a built-in microphone array, and performs speech enhancement processing on a sound signal corresponding to a target space direction, including but not limited to the following two possible implementation manners:
in a first possible implementation, the camera performs speech enhancement processing on the sound signals from the target spatial direction and performs speech suppression processing on the sound signals from the non-target spatial direction. Wherein the non-target spatial direction is a spatial direction other than the target spatial direction.
Illustratively, when the camera receives that the target spatial direction transmitted by the terminal is "15 °, the voice enhancement processing is performed on the voice signal from the 15 ° direction, and the voice suppression processing is performed on the voice signal from the spatial direction other than 15 °.
In a second possible implementation manner, the camera determines a target local space corresponding to the target space direction according to a second preset corresponding relationship, where the second preset corresponding relationship includes a corresponding relationship between the space direction and the local space; the sound signal from the target local space is subjected to a speech enhancement process, and the sound signal from the non-target local space is subjected to a speech suppression process.
And the non-target local space is other space except the target local space in the video acquisition area.
Optionally, the camera pre-constructs a three-dimensional space corresponding to the video acquisition area according to at least one camera, and divides the three-dimensional space into N local spaces, where a second preset relationship between the spatial direction and the local spaces is stored in the camera. The local space refers to a local three-dimensional space in a scene where the camera is located.
Illustratively, the three-dimensional space is divided into 24 partial spaces in advance, that is, a partial space a4 through a partial space H4, a partial space a5 through a partial space H5, and a partial space a6 through a partial space H6, and a second preset relationship between the spatial direction and the partial space stored in the camera is as shown in table two.
Watch two
Direction of space Local space Direction of space Local space Direction of space Local space
15° A4 135° A5 -120° A6
30° B4 150° B5 -105° B6
45° C4 165° C5 -90° C6
60° D4 180° D5 -75° D6
75° E4 -180° E5 -60° E6
90° F4 -165° F5 -45° F6
105° G4 -150° G5 -30° G6
120° H4 -135° H5 -15° H6
It should be noted that, because the target image is an image corresponding to the video capture area, and the three-dimensional space is a space corresponding to the video capture area, the dividing manner in which the camera divides the three-dimensional space into N local spaces may be corresponding to the dividing manner in which the terminal divides the target image into N image areas, or may not be corresponding to the dividing manner. When the two division modes correspond to each other, the image regions and the local spaces have corresponding relations, and the spatial angle range of each image region and the local space corresponding to the image region is the same.
Illustratively, as shown in fig. 7, when the camera receives that the target spatial direction transmitted by the terminal is "15 °", the camera determines that the local space a4 corresponding to the target spatial direction "15 °" is the target local space 71, the spatial angle range corresponding to the target local space 71 is (0, 15 ° ]), and the camera performs the voice enhancement processing on the voice signal from the target local space 71 and performs the voice suppression processing on the voice signal from the local space other than the target local space 71, according to the second preset correspondence provided in the above table two.
Optionally, the camera performs speech enhancement processing on the sound signal corresponding to the target space direction through a self-adaptive beam forming algorithm, and outputs the enhanced sound signal.
Wherein the adaptive beamforming algorithm comprises at least one of a Minimum Variance Distortionless Response (MVDR), a Generalized Sidelobe Canceller (GSC), and a Transfer Function Generalized Sidelobe Canceller (TF-GSC).
In summary, in the embodiment of the present application, when the terminal receives a preset operation in the target image, the image area corresponding to the preset operation is determined as the target image area, and according to the first preset corresponding relationship, the spatial direction corresponding to the target image area is determined as the target spatial direction; the terminal can determine the corresponding target space direction through the first preset corresponding relation according to the appointed target image area, so that the situation that when a plurality of sound sources exist in the environment, the sound signal with the strongest sound is usually selected as the target space direction to cause sound source positioning error is avoided, and the accuracy of sound source positioning is ensured.
The embodiment of the application also performs voice enhancement processing on the sound signals from the target space direction and performs voice suppression processing on the sound signals from the non-target space direction, thereby effectively reducing the influence of environmental noise and greatly improving the anti-noise performance of the voice enhancement system.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Please refer to fig. 8, which illustrates a schematic structural diagram of a speech enhancement apparatus according to an exemplary embodiment of the present application. The speech enhancement device can be implemented by a dedicated hardware circuit, or a combination of hardware and software, as all or part of the speech enhancement system in fig. 1, and comprises: an acquisition module 810, a determination module 820, and an enhancement module 830.
The acquiring module 810 is configured to acquire a target image of a video acquisition area, where the target image includes N image areas, and N is a positive integer greater than 1;
a determining module 820, configured to determine, when a preset operation on a target image area in the N image areas is received, a target spatial direction corresponding to the target image area, where the target spatial direction is used to indicate a spatial direction in which voice enhancement processing needs to be performed;
and the enhancing module 830 is configured to perform speech enhancement processing on the sound signal corresponding to the target spatial direction.
Optionally, the determining module 820 is further configured to determine, when a preset operation in the N image areas is received, an image area corresponding to the preset operation as a target image area; and determining the space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the first preset corresponding relation comprises the corresponding relation between the image area and the space direction.
Optionally, the enhancing module 830 is further configured to perform speech enhancement processing on the sound signal from the target spatial direction, and perform speech suppression processing on the sound signal from the non-8 target spatial directions;
wherein the non-target spatial direction is a spatial direction other than the target spatial direction.
Optionally, the enhancing module 830 is further configured to determine a target local space corresponding to the target space direction according to a second preset corresponding relationship, where the second preset corresponding relationship includes a corresponding relationship between the space direction and the local space; carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
and the non-target local space is other space except the target local space in the video acquisition area.
Optionally, the video acquisition area includes M different shooting areas, where M is a positive integer greater than 1, and the obtaining module 810 is further configured to obtain shooting images corresponding to the M shooting areas; and splicing the M shot images to obtain a target image.
Optionally, the apparatus includes a camera and a terminal, the camera is connected to the terminal, and the camera includes at least three cameras and at least six microphones.
The embodiment of the present application further provides a camera, where the camera includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the speech enhancement method provided in the foregoing method embodiments.
The embodiment of the present application further provides a terminal, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the speech enhancement method provided in each of the above method embodiments.
The present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the speech enhancement method provided in the above-mentioned method embodiments.
Fig. 9 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal 900 is a terminal connected to a camera in the above-described voice enhancement system. For example, the terminal 900 is a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), an MP4 player, a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.
In general, terminal 900 includes: a processor 901 and a memory 902.
Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the speech enhancement methods provided by the various method embodiments herein.
In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera 906, an audio circuit 907, a positioning component 908, and a power supply 909.
The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.
The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.
The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.
Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.
The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the touch display 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 912 may detect a body direction and a rotation angle of the terminal 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
Pressure sensors 913 may be disposed on the side bezel of terminal 900 and/or underneath touch display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the user's holding signal of the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the touch display 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.
The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the touch display 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 905 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 905 is turned down. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.
Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the touch display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the processor 901 controls the touch display 905 to switch from the breath screen state to the bright screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.
It should be noted that: in the speech enhancement apparatus provided in the above embodiment, when performing speech enhancement, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the embodiments of the speech enhancement method and apparatus provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (6)

1. A speech enhancement method, wherein a video capture area comprises M different capture areas, wherein M is a positive integer greater than 1, the method comprising:
acquiring shot images corresponding to the M shooting areas respectively;
splicing the M shot images to obtain a target image, wherein the target image is a panoramic image and comprises N image areas, N is a positive integer larger than 1, and the N image areas are obtained by dividing the target image according to the number of the shot areas;
when a voice signal is received, analyzing the voice signal to obtain voice content, and when a keyword matched with preset information corresponding to a target image area in the N image areas exists in the voice content, determining the target image area corresponding to the voice signal;
determining a space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the target space direction is used for indicating the space direction needing voice enhancement processing, and the first preset corresponding relation comprises the corresponding relation between the image area and the space direction;
determining a target local space corresponding to the target space direction according to a second preset corresponding relation, wherein the second preset corresponding relation comprises a corresponding relation between the space direction and the local space;
carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
wherein the non-target local space is other space in the video acquisition area except the target local space.
2. A speech enhancement device wherein a video capture area comprises M different capture areas, M being a positive integer greater than 1, the device comprising:
the acquisition module is used for acquiring shot images corresponding to the M shooting areas; splicing the M shot images to obtain a target image, wherein the target image is a panoramic image and comprises N image areas, N is a positive integer larger than 1, and the N image areas are obtained by dividing the target image according to the number of the shot areas;
the determining module is used for analyzing the voice signal to obtain voice content when the voice signal is received, and determining a target image area corresponding to the voice signal when a key word matched with preset information corresponding to the target image area in the N image areas exists in the voice content; determining a space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the target space direction is used for indicating the space direction needing voice enhancement processing, and the first preset corresponding relation comprises the corresponding relation between the image area and the space direction;
the enhancement module is used for determining a target local space corresponding to the target space direction according to a second preset corresponding relation, wherein the second preset corresponding relation comprises a corresponding relation between the space direction and the local space; carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
wherein the non-target local space is other space in the video acquisition area except the target local space.
3. A camera comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the speech enhancement method of claim 1.
4. A terminal, characterized in that the terminal comprises a processor and a memory, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by the processor to implement the speech enhancement method according to claim 1.
5. A voice enhancement system is characterized in that the system comprises a camera and a terminal, the camera is connected with the terminal and comprises at least three cameras and at least six microphones, a video acquisition area comprises M different shooting areas, M is a positive integer larger than 1,
the camera is used for acquiring the shot images corresponding to the M shot areas respectively, splicing the M shot images to obtain a target image, and sending the target image to the terminal;
the terminal is used for receiving the target image, the target image is a panoramic image and comprises N image areas, N is a positive integer larger than 1, and the N image areas are obtained by dividing the target image according to the number of shooting areas;
the terminal is further used for analyzing the voice signal to obtain voice content when the voice signal is received, and determining a target image area corresponding to the voice signal when a key word matched with preset information corresponding to the target image area in the N image areas exists in the voice content; determining a space direction corresponding to the target image area as a target space direction according to a first preset corresponding relation, wherein the target space direction is used for indicating the space direction needing voice enhancement processing, and the first preset corresponding relation comprises the corresponding relation between the image area and the space direction;
the terminal or the camera is configured to determine a target local space corresponding to the target space direction according to a second preset corresponding relationship, where the second preset corresponding relationship includes a corresponding relationship between the space direction and the local space; carrying out voice enhancement processing on the sound signals from the target local space and carrying out voice suppression processing on the sound signals from the non-target local space;
wherein the non-target local space is other space in the video acquisition area except the target local space.
6. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the speech enhancement method of claim 1.
CN201810185895.9A 2018-03-07 2018-03-07 Voice enhancement method and device Active CN110248197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810185895.9A CN110248197B (en) 2018-03-07 2018-03-07 Voice enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810185895.9A CN110248197B (en) 2018-03-07 2018-03-07 Voice enhancement method and device

Publications (2)

Publication Number Publication Date
CN110248197A CN110248197A (en) 2019-09-17
CN110248197B true CN110248197B (en) 2021-10-22

Family

ID=67882419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810185895.9A Active CN110248197B (en) 2018-03-07 2018-03-07 Voice enhancement method and device

Country Status (1)

Country Link
CN (1) CN110248197B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI708191B (en) * 2019-11-28 2020-10-21 睿捷國際股份有限公司 Sound source distribution visualization method and computer program product thereof
CN113450769A (en) * 2020-03-09 2021-09-28 杭州海康威视数字技术股份有限公司 Voice extraction method, device, equipment and storage medium
CN113542466A (en) * 2021-07-07 2021-10-22 Oppo广东移动通信有限公司 Audio processing method, electronic device and storage medium
CN116055869B (en) * 2022-05-30 2023-10-20 荣耀终端有限公司 Video processing method and terminal

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4441879B2 (en) * 2005-06-28 2010-03-31 ソニー株式会社 Signal processing apparatus and method, program, and recording medium
CN105474665A (en) * 2014-03-31 2016-04-06 松下知识产权经营株式会社 Sound processing apparatus, sound processing system, and sound processing method
JP6135880B2 (en) * 2014-04-25 2017-05-31 パナソニックIpマネジメント株式会社 Audio processing method, audio processing system, and storage medium
US20160241818A1 (en) * 2015-02-18 2016-08-18 Honeywell International Inc. Automatic alerts for video surveillance systems
CN107230187B (en) * 2016-03-25 2022-05-24 北京三星通信技术研究有限公司 Method and device for processing multimedia information

Also Published As

Publication number Publication date
CN110248197A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN110248197B (en) Voice enhancement method and device
CN109862412B (en) Method and device for video co-shooting and storage medium
CN110839128B (en) Photographing behavior detection method and device and storage medium
WO2022134632A1 (en) Work processing method and apparatus
CN110797042B (en) Audio processing method, device and storage medium
CN111028144A (en) Video face changing method and device and storage medium
CN110941375A (en) Method and device for locally amplifying image and storage medium
CN109783176B (en) Page switching method and device
CN110677713A (en) Video image processing method and device and storage medium
CN113160031B (en) Image processing method, device, electronic equipment and storage medium
CN112396076A (en) License plate image generation method and device and computer storage medium
CN107943484B (en) Method and device for executing business function
CN111354378A (en) Voice endpoint detection method, device, equipment and computer storage medium
CN111860064A (en) Target detection method, device and equipment based on video and storage medium
CN112882094B (en) First-arrival wave acquisition method and device, computer equipment and storage medium
CN110443841B (en) Method, device and system for measuring ground depth
CN111757146B (en) Method, system and storage medium for video splicing
CN108881739B (en) Image generation method, device, terminal and storage medium
CN112990421A (en) Method, device and storage medium for optimizing operation process of deep learning network
CN111402873A (en) Voice signal processing method, device, equipment and storage medium
CN112990424A (en) Method and device for training neural network model
CN112150554B (en) Picture display method, device, terminal and storage medium
CN110660031B (en) Image sharpening method and device and storage medium
CN109886226B (en) Method and device for determining characteristic data of image, electronic equipment and storage medium
CN111354032B (en) Method and device for generating disparity map

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant