WO2022059869A1 - Dispositif et procédé pour améliorer la qualité sonore d'une vidéo - Google Patents

Dispositif et procédé pour améliorer la qualité sonore d'une vidéo Download PDF

Info

Publication number
WO2022059869A1
WO2022059869A1 PCT/KR2021/002170 KR2021002170W WO2022059869A1 WO 2022059869 A1 WO2022059869 A1 WO 2022059869A1 KR 2021002170 W KR2021002170 W KR 2021002170W WO 2022059869 A1 WO2022059869 A1 WO 2022059869A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
image
unit
sound source
sound data
Prior art date
Application number
PCT/KR2021/002170
Other languages
English (en)
Korean (ko)
Inventor
카주크제이쿱
자르네키피오트르
그루지악그루지고르
카프카슬로보미르
Original Assignee
삼성전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자 주식회사 filed Critical 삼성전자 주식회사
Publication of WO2022059869A1 publication Critical patent/WO2022059869A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • the present disclosure relates to a device and method for improving the sound quality of an image, and more particularly, a device and method for improving the sound quality of an overall image by separating the sound for each sound source and individually adjusting the volume of the separated unit sound data is about
  • Filming is an action that captures the world around you. Every modern mobile device equipped with a camera has the ability to capture video. As mobile devices, such as smart phones, become widespread, the number of individuals taking and viewing images is increasing. Although the quality of video recordings through mobile devices has improved over time, most of them focus on improving the quality of recorded visual images or improving the visual user experience. On the other hand, the improvement of sound quality is hardly addressed.
  • each viewer can view a different video while moving in public transportation, in the office, or in the bathroom. It is often viewed on a mobile device.
  • a headset or earphone is generally used to focus on the image and not to disturb the surroundings. Headsets and earphones support stereo-type sounds in which sounds reproduced from the left and right are different from each other. Therefore, even in the case of sound recorded as mono audio through a single microphone, it is necessary to convert it to a stereo format or another multi-channel format to improve sound quality.
  • a separate microphone such as a shotgun microphone or a lapel microphone is used, or the video is compressed by moving the video to a computer such as a computer after shooting.
  • a separate manual post-processing operation such as noise removal.
  • Separate professional microphone equipment is expensive, and it is inconvenient to bring it with you every time you shoot.
  • a separate post-processing process for sound quality improvement requires a video editing program and professional knowledge to handle the program, and it is difficult to directly edit an image on a mobile device such as a smart phone with a small screen. Therefore, it is not easy for a general user who wants to shoot and distribute an image with a smart phone to improve the sound quality of the image.
  • the sound quality of the video captured through the camera and microphone included in the mobile device is automatically improved within the mobile device without requiring a separate sound equipment or post-processing operation. It requires skills to do it.
  • An embodiment of the present disclosure obtains a sound source image representing at least one sound source from an image of an image, separates the sound of the image into unit sound data according to whether the sound is generated from the same sound source, and includes the sound source image and the sound source image.
  • a device capable of adjusting the number of channels of the output sound regardless of the number of channels of the input sound by matching each unit sound data and adjusting the loudness of each unit sound data, and improving the sound quality of the output image; and method can be provided.
  • an image is captured through an input unit included in the mobile device, and a processor included in the mobile device automatically performs sound processing on the captured image to improve sound quality.
  • a processor included in the mobile device automatically performs sound processing on the captured image to improve sound quality.
  • a method for a device to improve the sound quality of an image includes: acquiring an image; acquiring a sound and an image from the acquired image; obtaining a sound source image representing at least one sound source from the obtained image; obtaining at least one unit sound data corresponding to the at least one sound source from the obtained sound; matching each of the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model; tracking the movement of the at least one sound source from the sound source image; and individually adjusting the loudness of the unit sound data according to the movement of the tracked sound source.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • the device includes: an input unit for acquiring an image; an output unit for outputting an output image; a memory storing a program including one or more instructions; and at least one processor executing one or more instructions stored in the memory.
  • the at least one processor acquires an image by controlling the input unit, acquires a sound and an image from the acquired image, and receives at least one sound source from the acquired image Obtaining a sound source image representing, from the obtained sound, obtaining at least one unit sound data corresponding to the at least one sound source, and applying a preset sound-image matching model, the at least one sound source image and the Each of at least one unit sound data may be matched, the movement of the at least one sound source may be tracked from the sound source image, and the loudness of the unit sound data may be individually adjusted according to the tracked movement of the sound source.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • the computer-readable recording medium may store a program for executing at least one of the embodiments of the disclosed method in a computer.
  • FIG. 1 is a schematic diagram of a method for a device to improve sound quality of an image according to an embodiment of the present disclosure.
  • FIG. 2 is a block diagram of a device according to an embodiment of the present disclosure.
  • FIG. 3 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.
  • FIG. 5 is a diagram for describing an operation in which a device acquires an additional sound through an auxiliary input unit according to an embodiment of the present disclosure.
  • FIG. 6 is a diagram for explaining an operation in which a device acquires at least one sound source image from an image according to an embodiment of the present disclosure.
  • FIG. 7 is a diagram for describing an operation in which a device acquires at least one unit sound data from a sound according to an embodiment of the present disclosure.
  • FIG. 8 is a diagram for explaining an operation in which a device separates a sound according to a sound source image and matches the separated unit sound data to each sound source image according to an embodiment of the present disclosure.
  • FIG. 9 is a view for explaining a specific embodiment of the operation of the device individually adjusting the volume of unit sound data according to the movement of the tracked sound source according to an embodiment of the present disclosure.
  • 10A is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.
  • 10B is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.
  • 10C is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.
  • FIG. 11 is a diagram illustrating an example in which a device individually adjusts a volume of unit sound data according to an embodiment of the present disclosure.
  • FIG. 12 is a diagram illustrating an example in which a device individually adjusts a volume of unit sound data according to a motion of a tracked sound source according to an embodiment of the present disclosure.
  • FIG. 13 is a diagram illustrating an example in which a device adjusts a volume of unit sound data according to a motion of a tracked sound source and obtains an output sound having multi-channels from the adjusted unit sound data according to an embodiment of the present disclosure.
  • FIG. 14 is a diagram illustrating an example in which a device acquires an additional sound through an auxiliary input unit and acquires an output sound having multi-channels according to an embodiment of the present disclosure
  • a method for a device to improve the sound quality of an image includes the steps of acquiring an image, acquiring a sound and an image from the acquired image, acquiring a sound source image representing at least one sound source from the acquired image, the acquired Acquiring at least one unit sound data corresponding to at least one sound source from the sound, applying a preset sound-image matching model to match at least one sound source image and at least one unit sound data, respectively; It may include tracking the movement of at least one sound source from the sound source image, and individually adjusting the volume (loudness) of the unit sound data according to the tracked movement of the sound source.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • acquiring the image may include acquiring the image through an input unit included in the device, and the input unit may include a microphone for acquiring a sound and a camera for acquiring an image there is.
  • acquiring the image may include acquiring the image through an input unit included in the device and an auxiliary input unit external to the device.
  • the input unit may include a microphone for acquiring a sound and a camera for acquiring an image.
  • the auxiliary input unit may include an auxiliary microphone for acquiring additional sound.
  • the acquiring at least one unit sound data corresponding to the at least one sound source from the acquired sound may include dividing the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum. may include doing When two or more unit sound data having the same amplitude, frequency, phase, waveform, and spectrum exist, it may include separating the two or more unit sound data into respective unit sound data using a sound source image.
  • the step of matching each of the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model includes additionally using information obtained from the sound source image to obtain at least one sound source It may include matching the image and the at least one unit sound data, respectively.
  • the step of tracking the motion of the at least one sound source from the sound source image may include tracking the movement of the corresponding sound source through a state change of the sound source image.
  • the step of tracking the motion of the at least one sound source from the sound source image includes the motion information of the device obtained from a motion sensor including an accelerometer, a gyroscope, and a magnetometer.
  • a motion sensor including an accelerometer, a gyroscope, and a magnetometer.
  • it may include tracking the movement of the sound source through the state change of the sound source image.
  • the step of individually adjusting the volume of the unit sound data according to the movement of the tracked sound source is performed for each unit sound data, the step of obtaining a volume curve of the total execution time of each unit sound data It may include obtaining a volume correction curve including adjustment information to be performed, and individually adjusting the volume of each unit sound data based on the volume correction curve.
  • the method may further include obtaining an output sound from unit sound data whose volume is individually adjusted, and obtaining an output image from the output sound and the image.
  • the step of obtaining an output sound from the unit sound data whose volume is individually adjusted may include rendering the unit sound data by classifying it into two or more channels, and obtaining an output sound having multiple channels. there is.
  • a device for improving the sound quality of an image may be provided.
  • the device may include an input unit for acquiring an image, an output unit for outputting an output image, a memory storing a program including one or more instructions, and at least one processor executing one or more instructions stored in the memory.
  • the at least one processor acquires an image by controlling the input unit, acquires a sound and an image from the acquired image, and acquires a sound source image representing at least one sound source from the acquired image and, from the acquired sound, acquire at least one unit sound data corresponding to at least one sound source, and apply a preset sound-image matching model to match at least one sound source image and at least one unit sound data, respectively And, it is possible to track the movement of at least one sound source from the sound source image, and individually adjust the volume (loudness) of the unit sound data according to the movement of the tracked sound source.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • the input unit may include a microphone for acquiring a sound and a camera for acquiring an image.
  • the processor may execute one or more instructions to obtain additional sound through an auxiliary microphone external to the device.
  • the processor executes one or more instructions to separate the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum, wherein the amplitude, frequency, phase, waveform and spectrum are all equal.
  • at least one unit sound data corresponding to at least one sound source is obtained from the acquired sound by separating two or more unit sound data into each unit sound data using a sound source image can do.
  • the processor executes one or more instructions and additionally uses information obtained from the sound source image to match at least one sound source image and at least one unit sound data, respectively, to obtain a preset sound-image matching model By applying, it is possible to match at least one sound source image and at least one unit sound data, respectively.
  • the processor may execute one or more instructions to track the motion of the sound source through a change in the state of the sound source image.
  • the processor executes one or more instructions to use motion information of the device obtained from motion sensors including an accelerometer, a gyroscope, and a magnetometer to determine the state of the sound source image. Through the change, the movement of the corresponding sound source can be tracked.
  • the processor executes one or more instructions to obtain a volume curve of the total execution time of each unit sound data, obtain a volume correction curve comprising adjustment information to be performed on each unit sound data, , by individually adjusting the volume of each unit sound data based on the volume correction curve, it is possible to individually adjust the volume of the unit sound data according to the movement of the tracked sound source.
  • the processor may further include executing one or more instructions to obtain an output sound from the unit sound data whose volume is individually adjusted, and to obtain an output image from the output sound and the image.
  • the processor may execute one or more instructions to classify and render unit sound data into two or more channels, and obtain an output sound having multi-channels.
  • a computer-readable recording medium in which a program for executing any one method according to the present disclosure in a computer is recorded may be provided.
  • a processor configured (or configured to perform) A, B, and C refers to a dedicated processor (eg, an embedded processor) for performing those operations, or by executing one or more software programs stored in memory; It may refer to a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.
  • 'video' means audiovisual material including an auditory sound and a visual screen.
  • the visual composition of the image may be described as 'image', 'visual data' or 'picture', and the auditory composition of the image is 'audio', 'sound' )', 'acoustic', or 'sound data'.
  • 'sound quality means the quality of sound. Sound quality may vary depending on various acoustic factors. For example, the amount of noise may be a criterion for sound quality, and the sound quality may vary according to the flatness of the frequency and the flatness of the volume.
  • a 'sound source' means a source of a sound from which a sound is generated.
  • a sound source may be a person, an animal, various musical instruments, or any object if a sound is generated therefrom.
  • 'sound corresponding to a specific sound source' means a sound generated from the specific sound source.
  • a sound corresponding to a specific person may mean a voice made by the person
  • a sound corresponding to a specific animal may mean a cry of the corresponding animal.
  • screen area means an area on a screen in which a visual screen of an image is displayed (displayed).
  • the screen range may be an area defined by a border of an image captured from an image at a specific point in time.
  • 'monaural, monophonic means audio composed of one channel.
  • Mono audio is recording through one microphone, and sound heard through one speaker may correspond to this. Even if the sound is recorded or reproduced through multiple speakers, if the sound is connected to only one channel, it may be mono audio. In mono audio, the same sound is played from all connected speakers.
  • 'multi-channel' audio means audio composed of two or more channels.
  • stereo audio which is a type of multi-channel audio
  • signals of two channels are synthesized and reproduced as one, but two speakers (eg, a headset, an earphone ( earphone), different sounds are played from both speakers, and a sense of space and rich sound can be reproduced compared to mono audio.
  • FIG. 1 is a schematic diagram of a method of improving the sound quality of an image by a device 1000 according to an embodiment of the present disclosure.
  • an image input to the device 1000 may include an image 110 and a sound 150 .
  • the image 110 may be recorded through an input device such as a camera, and the sound 150 may be recorded through an input device such as a microphone.
  • the device 1000 may acquire sound source images SS1 , SS2 , and SS3 separated into individual sound sources from the recorded image 110 .
  • the sound source image may be divided into a speaker (person) (SS1, SS2) and a background (eg, a person who is not a speaker, desk, chair, paper, and background images) (SS3).
  • the device 1000 may acquire the unit sound data set 160 obtained by dividing the sound 150 by the sound generated by each sound source.
  • the device 1000 may classify the separated unit sound data set 160 as human voices (UA1, UA2) or background sound (noise) (UA3).
  • the device 1000 may match the classified unit sound data set 160 to the separated sound source images SS1, SS2, and SS3 by applying a preset sound-image matching model, respectively.
  • unit sound data (UA1, UA2) classified as a human voice may be matched to sound source images (SS1, SS2) determined to be human, and other sounds (UA3) are in the background image (SS3).
  • SS3 background image
  • the device 1000 receives the separated unit sound data UA1 and UA2 through the 'person's face and the voice corresponding to the face' information included in the sound-image matching model, respectively. can be matched to the sound source images (SS1, SS2) of
  • the device 1000 may individually adjust the volume of each of the separated unit sound data UA1 , UA2 , and UA3 .
  • the volume of unit sound data UA3 corresponding to noise can be reduced, and the volume of unit sound data (UA1, UA2) corresponding to the speaker's conversation can be adjusted to a level corresponding to the input signal or a preset level. there is.
  • the device 1000 may resynthesize the output sound 170 from the adjusted unit sound data.
  • the output sound 170 may be synthesized in a stereo format, which is a type of multi-channel sound, for output to an output device such as headphones.
  • the output sound 170 may include a first channel 171 output to the left speaker LC and a second channel 173 output to the right speaker RC.
  • the device 1000 may determine the rendering channel of the unit sound data UA1 and UA2 corresponding to each of the sound source images SS1 and SS2 according to the relative positions within the screen range of the sound source images SS1 and SS2. For example, since the first sound source image SS1 is located on the left within the screen range, the first unit sound data UA1 corresponding to the first sound source image SS1 is output to the first channel ( 171) can be rendered. In addition, the second unit sound data UA2 corresponding to the second sound source image SS2 located on the right side within the screen range may be rendered on the second channel 173 output to the right speaker RC. In an embodiment, the device 1000 may adjust the number of channels of the output sound 170 irrespective of the number of channels of the input sound 150 , and may improve the sound quality of an image to output a sense of space and rich sound. there is.
  • FIG. 2 is a block diagram of a device 1000 according to an embodiment of the present disclosure.
  • the device 1000 may include an input unit 1100 , a processor 1300 , a memory 1500 , an output unit 1700 , and a motion sensor 1900 . Not all of the components shown in FIG. 2 are essential components of the device 1000 .
  • the device 1000 may be implemented by more components than the components shown in FIG. 2 , or the device may be implemented by fewer components than the components shown in FIG. 2 .
  • the input unit 1100 may acquire an image from the outside.
  • the input unit 1100 may include a recorder for acquiring a visual image and a recorder for acquiring an auditory sound.
  • the recording unit may include a camera (Camera), and the recording unit may include a microphone (Microphone, mic).
  • the input unit 1100 may have a single configuration that is not physically separated into a recording unit and a recording unit.
  • the output unit 1700 may output an output image to the outside.
  • the output unit 1700 may include a display 1710 and an audio output unit 1720 .
  • the display 1710 may output a visual image by externally displaying it.
  • the display 1710 may include a panel.
  • the display 1710 is, for example, a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, 3 It may be configured as at least one of a 3D display and an electrophoretic display.
  • the audio output unit 1720 may reproduce and output an auditory sound to the outside.
  • the audio output unit 1720 may include a speaker.
  • the audio output unit 1720 may include, for example, a single speaker, two or more speakers, a mono speaker, a stereo speaker, a surround speaker, a headset, an earphone ( earphone).
  • the display 1710 and the audio output unit 1720 of the output unit 1700 may have a single structure that is not physically separated.
  • the memory 1500 may store a program to be executed by the processor 1300 to be described later in order to control the operation of the device 1000 .
  • the memory 1500 may store a program including at least one instruction for controlling the operation of the device 1000 .
  • Instructions and program codes readable by the processor 1300 may be stored in the memory 1500 .
  • the processor 1300 may be implemented to execute instructions or codes of a program stored in the memory 1500 .
  • the memory 1500 may store data input to or output from the device 1000 .
  • Memory 1500 is, for example, a flash memory (flash memory), a hard disk (hard disk), a multimedia card micro type (multimedia card micro type), card type memory (eg, SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, It may include at least one type of storage medium among a magnetic disk and an optical disk.
  • Programs stored in the memory 1500 may be classified into a plurality of modules according to their functions.
  • the memory 1500 includes an acoustic image separation module 1510 , a sound source image acquisition module 1520 , a unit acoustic data acquisition module 1530 , a matching module 1540 , a sound source motion tracking module 1550 , and a volume may include an adjustment module 1560 .
  • the memory 1500 may include an acoustic-image matching model 1570 , a deep neural network (DNN) 1580 , and a database 1590 .
  • DNN deep neural network
  • the processor 1300 may control the overall operation of the device 1000 .
  • the processor 1300 executes programs stored in the memory 1500 , and thus an input unit 1100 , an output unit 1700 including a display 1710 , and an audio output unit 1720 , and a motion sensor 1900 . and overall control of the memory 1500 and the like.
  • the processor 1300 may be composed of hardware components that perform arithmetic, logic, input/output operations and signal processing.
  • the processor 1300 is, for example, a central processing unit (Central Processing Unit), a microprocessor (microprocessor), a graphic processor (Graphic Processing Unit), ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable Logic Devices), and FPGAs (Field Programmable Gate Arrays) may be configured as at least one, but is not limited thereto.
  • a central processing unit Central Processing Unit
  • microprocessor microprocessor
  • a graphic processor Graphic Processing Unit
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DSPDs Digital Signal Processing Devices
  • PLDs Programmable Logic Devices
  • FPGAs Field Programmable Gate Arrays
  • the processor 1300 may acquire an image through the input unit 1100 by executing at least one instruction stored in the memory 1500 .
  • the image may include an image that is visual data and a sound that is auditory data.
  • the processor 1300 may obtain a sound and an image from the acquired image by executing at least one instruction constituting the sound image separation module 1510 among programs stored in the memory 1500 . there is.
  • an image composed of a single mono file may be divided into a sound file that is auditory data and an image file that is visual data.
  • the processor 1300 executes at least one command constituting the sound source image obtaining module 1520 among the programs stored in the memory 1500, thereby generating a sound source image representing at least one sound source from the obtained image.
  • An image may be composed of one continuous screen that is not divided. In such a continuous screen, each object such as a person, an animal, or a thing may be separated. Each of the separated objects may be a sound source that generates a sound.
  • at least one sound source image may be obtained from an image using a deep neural network (DNN) or a database in which image files are accumulated.
  • DNN deep neural network
  • the processor 1300 executes at least one instruction constituting the unit sound data acquisition module 1530 among the programs stored in the memory 1500, thereby determining whether the acquired sound is generated from the same sound source or not. It is possible to acquire unit sound data.
  • the unit sound data acquisition module 1530 may separate an input sound composed of one channel into a plurality of channels of unit sound data generated from different sound sources.
  • the unit sound data acquisition module 1530 may separate sound into unit sound data using a deep neural network (DNN) or a database in which audio information is accumulated.
  • Information of the separated unit sound data is transferred to and stored in the database 1570 stored in the memory 1500 , and the database 1570 may be updated.
  • a model for separating sound data may be preset and stored in a database.
  • the processor 1300 executes at least one instruction constituting the matching module 1540 among programs stored in the memory 1500, thereby applying a preset sound-image matching model 1570 to at least one sound source image and at least one unit of sound data may be respectively matched.
  • the sound-image matching model 1570 may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • sound information according to the information of the sound source image eg, a barking sound corresponding to a dog image, a rustling sound corresponding to a tree image, or a specific person's information when two or more people are talking
  • Characteristics of a sound according to a specific image such as voice information based on a face image
  • information about a sound source image according to a specific sound eg, mouth shape of the image when a specific sound is generated
  • the acoustic-image matching model 1570 may be established using a deep neural network (DNN) or a database in which image files and audio information are accumulated.
  • DNN deep neural network
  • the individual sound source images separated from the image and the individual unit sound data separated from the sound may be matched based on the correspondence information between the sound and the image included in the sound-image matching model 1570 , respectively.
  • Information of each sound source image and unit sound data matched according to the preset sound-image matching model 1570 is transmitted back to the sound-image matching model 1570 stored in the memory 1500 and stored, and sound-image matching
  • the model 1570 may be updated, or it may be transferred to and stored in the database 1590 and update the database 1590 .
  • the processor 1300 may track the motion of at least one sound source from the sound source image by executing at least one command constituting the sound source motion tracking module 1550 among programs stored in the memory 1500 .
  • the state or position of a specific sound source image may change over time.
  • the sound source motion tracking module 1550 analyzes an image (screen), analyzes the moving direction, speed, change of the shape of the sound source image, etc. of a specific sound source, and obtains a motion profile for a particular sound source can do.
  • the obtained motion profile may be used to individually adjust the volume of each unit acoustic data in a subsequent step.
  • the processor 1300 executes at least one command constituting the volume control module 1560 among the programs stored in the memory 1500 to individually adjust the volume of the unit sound data according to the movement of the tracked sound source. Can be adjusted.
  • the volume of the unit sound data may be adjusted so that the output sound maintains a constant volume in the entire image. For example, in the case of recording a person speaking, as the sound source (the person speaking) moves relatively far from the device 1000 during recording, the volume of the input unit sound data decreases. In this case, based on the volume when the sound source is close to the device 1000 , the volume may be adjusted so that the unit sound data may have a constant volume in the entire image. In this way, based on the motion profile of the sound source obtained in the sound source motion tracking module 1550 previously, the volume of unit sound data corresponding to the sound source may be individually adjusted.
  • the processor 1300 executes at least one command constituting the volume control module 1560 among the programs stored in the memory 1500 to individually adjust the volume of unit sound data according to the type of sound source.
  • the type of sound source may be For example, when specific unit sound data is classified as noise, the volume of the corresponding unit sound data may be adjusted to be small.
  • the volume of other types of unit sound data may be adjusted to be small like noise. there is.
  • a deep neural network (DNN) 1580 stored in the memory 1500 is a type of artificial neural network, and may have a feature of being composed of several hidden layers between an input layer and an output layer.
  • a deep neural network (DNN) 1580 can model complex nonlinear relationships like a general artificial neural network. For example, in a deep neural network structure for an object identification model, each object may be expressed as a hierarchical configuration of image basic elements. In this case, the additional layers may aggregate the characteristics of the gradually gathered lower layers. This feature of the deep neural network (DNN) 1580 makes it possible to model complex data with fewer units.
  • the deep neural network (DNN) 1580 may be applied to image recognition or speech recognition fields, and may be used for processing to separate images and match the separated images with respective voice information as in the present disclosure.
  • the database 1590 stored in the memory 1500 may be configured as a set of a vast amount of data.
  • the database 1590 may include sound information corresponding to a specific sound source image and image information corresponding to a specific sound.
  • the database 1590 may be used to set the sound-image matching model 1570 by acquiring matching information indicating the correspondence between the sound and the image. Also, the database 1590 may be used to match unit sound data and sound source images, or to adjust the volume of each unit sound data.
  • the processor 1300 may obtain an output sound from the unit sound data whose volume is individually adjusted, and an output image from the output sound and the image, by executing at least one command stored in the memory 1500 .
  • final output sound data may be obtained by resynthesizing unit sound data whose volume is individually adjusted.
  • the processor 1300 may classify and render unit sound data into two or more channels in order to obtain an output sound having a multi-channel, such as a stereo format. For example, unit sound data corresponding to sound source images disposed on the left side on the display screen may be rendered as a first channel, and unit sound data corresponding to sound source images disposed on the right side on the display screen may be rendered as a second channel. there is.
  • an output sound in which the first channel is reproduced from the left speaker and the second channel is reproduced from the right speaker may be acquired.
  • the output sound may be not only in a stereo format, but also in a surround format, an Ambisonic format, or other multi-channel format.
  • the device 1000 may further include a motion sensor 1900 .
  • the motion sensor 1900 may include an accelerometer 1910 , a gyroscope 1920 , and a magnetometer 1930 .
  • the motion sensor 1900 may detect a motion of the device 1000 .
  • a motion profile of the sound source may be additionally obtained through a relative change on the screen of the sound source image. The obtained additional sound source motion profile may be used to adjust the volume of unit sound data matched to the corresponding sound source image.
  • the image acquired from the input unit 1100 included in the device 1000 is displayed as visual data. Separating the image and audio data into sound, obtaining a sound source image representing at least one sound source from the image of the image, and separating the sound of the image into unit sound data according to whether or not it is generated from the same sound source, , by matching the sound source image and unit sound data, respectively, and adjusting the loudness of each unit sound data, the sound quality of the output image can be improved.
  • the captured image is captured inside the device 1000 through the processor 1300 included in the device 1000 . It can be immediately post-processed. In this case, a separate audio equipment is not required to improve the sound quality of the captured image, and the mobile device automatically performs post-processing of the sound even if the user does not have a professional image post-processing technology, so that the image can be obtained.
  • the processor 1300 executes one or more instructions stored in the memory 1500 , thereby rendering the separated individual unit sound data into two or more different channels. can do. Accordingly, even when mono audio is recorded through a single microphone, the output image may have multi-channel stereo type sound, surround type sound, or ambisonic type sound. In this way, the number of channels of the output sound can be adjusted irrespective of the number of channels of the input sound, and high-quality sound for a more realistic image can be obtained.
  • FIG. 3 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.
  • an image may be acquired.
  • An image may mean an audiovisual expression drawn on a two-dimensional plane.
  • the video may mean a moving video.
  • the image may be acquired through an input unit including a microphone for acquiring a sound and a camera for acquiring an image.
  • a sound may be obtained from the image.
  • the sound may include a human voice, an animal sound, a sound generated from an object, noise, and the like.
  • the sound may be single-channel mono audio recorded from a single microphone or multi-channel audio recorded from a plurality of microphones.
  • an image may be obtained from the image.
  • the image may be visual data recorded from a camera.
  • the image may include sound source images of various sound sources.
  • step S330 at least one unit sound data corresponding to at least one sound source may be obtained from the sound. For example, it is possible to obtain at least one unit sound data determined according to whether or not it is generated in the same sound source.
  • a sound composed of mono audio of a single channel may be divided into a plurality of unit sound data generated from different sound sources.
  • an image may be used when dividing a sound into a plurality of unit sound data.
  • a sound source image representing at least one sound source may be obtained from the image.
  • objects such as a person, an animal, an object, a background, etc., each of which can be a sound source for generating a sound may be separated.
  • step S350 by applying a preset sound-image matching model, at least one sound source image and at least one unit sound data may be matched, respectively.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • the acoustic-image matching model may be preset through a deep neural network (DNN).
  • the sound and image matching information may include sound information according to information on a sound source image and information on a sound source image according to a specific sound.
  • the sound source images separated from the image and the unit sound data separated from the sound may each be matched one-to-one, many-to-one, or many-to-many based on an acoustic-image matching model.
  • the movement of the sound source image and the change of the sound source image may be considered.
  • the movement of at least one sound source may be tracked from the sound source image.
  • the motion may be tracked for each sound source image.
  • the motion profile of the sound source may be used to individually adjust the volume of each unit sound data in a subsequent step.
  • the movement of the sound source may be calculated and tracked through a change in the screen of the sound source image.
  • the motion of the sound source uses the motion information of the device obtained from a motion sensor including an accelerometer, a gyroscope, and a magnetometer to measure the relative change on the screen of the sound source image It can also be calculated and tracked through
  • the volume (loudness) of the unit sound data may be individually adjusted according to the movement of the tracked sound source.
  • the volume of each unit sound data separated from the sound may be decreased or increased, or may be adjusted to have a constant volume in the entire image.
  • an output sound may be obtained from unit sound data whose volume is individually adjusted.
  • final output sound data may be obtained by resynthesizing unit sound data whose volume is individually adjusted.
  • unit sound data may be classified into two or more channels and rendered, and output sound having a multi-channel such as a stereo format may be obtained.
  • the output sound may be resynthesized by adjusting the volume of unit sound data corresponding to noise to be small, or may be resynthesized to have multi-channels for a rich sound. The sound quality may be improved.
  • an output image may be obtained from the output sound and image.
  • the image (screen) is the same but includes the adjusted unit sound data, so the sound quality can be improved.
  • FIG. 4 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.
  • step S400 an image including a sound and an image may be obtained, and in steps S410 and S420, the sound and image may be obtained by separating the image from the image.
  • a sound source image indicating at least one sound source may be obtained from the image. For example, from an image composed of continuous visual data, objects, such as a person, an animal, an object, a background, etc., each of which can be a sound source for generating a sound may be separated.
  • step S440 at least one sound source image and a part of the sound may be matched by applying a preset sound-image matching model.
  • the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.
  • At least one unit sound data corresponding to at least one sound source may be obtained from the sound and the separated sound source image. For example, it is possible to obtain at least one unit sound data determined according to whether or not it is generated in the same sound source.
  • a sound source image may be used.
  • sound is divided into unit sound data, it is possible to determine in advance which sound source exists by referring to the previously separated sound source image, and to preferentially separate the sound by the sound source. For example, a portion of the sound matched to each sound source image may be separated into each unit sound data.
  • the operation of preferentially separating the sound source image from the image and separating the unit sound data from the sound using the matched sound source image may be useful when there is a sound source that does not appear in the image of the video. For example, after separating sounds corresponding to each separated sound source image into respective unit sound data, unit sound data corresponding to a sound source not appearing in the image of the video may be obtained from the remaining sound data.
  • step S460 the movement of at least one sound source may be tracked from the sound source image of the sound source.
  • the motion of the sound source may be tracked for each sound source image.
  • the motion profile of the sound source may be used to individually adjust the volume of each unit sound data in a subsequent step.
  • the loudness of the unit sound data may be individually adjusted according to the movement of the sound source tracked in step S470, and output from the unit sound data whose volume is individually adjusted in step S480 sound can be obtained.
  • the sound quality of the output sound may be improved compared to the initially input sound.
  • FIG. 5 is a diagram for explaining an operation in which the device 1000 acquires an additional sound through the auxiliary input unit 2100 according to an embodiment of the present disclosure.
  • the device 1000 may itself include an input unit including a microphone for acquiring a sound and a camera for acquiring an image.
  • the device 1000 may acquire an additional sound through the auxiliary input unit 2100 external to the device 1000 .
  • the auxiliary input unit 2100 may include an auxiliary microphone such as a lapel microphone.
  • the sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 may be sounds generated from the same sound source SS, and in an embodiment, the sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100
  • the sound acquired through 2100 may constitute a multi-channel sound as sound generated from different sound sources.
  • the sound directly obtained from the device 1000 and the sound obtained through the auxiliary input unit 2100 are generated from the same sound source SS, the sound directly obtained from the device 1000 and the sound obtained through the auxiliary input unit 2100 are obtained through the auxiliary input unit 2100
  • a sound can only have a difference in volume or signal-to-noise ratio.
  • the sound input through the auxiliary input unit 2100 may be transmitted to the device 1000 and used for post-processing of an image together with the sound acquired by the device 1000 .
  • the sound input through the auxiliary input unit 2100 may be used to remove acoustic noise of an image acquired by the device 1000 .
  • the sound acquired through the auxiliary input unit 2100 may support the sound acquired by the device 1000 itself, and does not completely replace the sound acquired by the device 1000 .
  • each sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 constitute multi-channel sound as sounds of different channels
  • each sound is post-processed together or independently to create a new mono channel It is possible to obtain output sound in format or multi-channel format.
  • FIG. 6 is a diagram for explaining an operation in which the device 1000 acquires at least one sound source image from the image 610 according to an embodiment of the present disclosure.
  • the device 1000 may obtain a sound source image representing at least one sound source from the image 610 .
  • an image 610 of an acquired image may be composed of continuous visual data. From the continuous visual data image, it is possible to separate objects, such as people, animals, objects, and backgrounds, which can each be a sound source for generating a sound.
  • image analysis may be analyzed through deep learning or artificial intelligence, which is a deep neural network (DNN) technology, in which case high accuracy and various object recognition are possible.
  • DNN deep neural network
  • Image recognition technology of artificial intelligence (AI) classifies images into several patterns and learns pattern-type data to determine what an image is when a new image is given.
  • the device 1000 through a deep neural network (DNN) or artificial intelligence (AI), in the image 610, human images (H1, H2, H3, H4, H5, H6) and dog images (D1) , D2) can be separated.
  • the separated human images H1, H2, H3, H4, H5, and H6 and dog images D1 and D2 may be sound source images, respectively.
  • the device 1000 may separate and obtain at least one sound source image from the image 610 .
  • FIG. 7 is a diagram for describing an operation in which the device 1000 acquires at least one unit sound data set 760 from the sound 750 according to an embodiment of the present disclosure.
  • the unit sound data 760 may be determined according to whether it is generated from the same sound source. Sound has three components: intensity, tone, and pitch. These three factors correspond to the amplitude, waveform, and frequency of a sound wave, respectively. The larger the amplitude of the wave, the louder the sound, and the higher the frequency of the wave, the higher the sound. The timbre of a sound is determined by its waveform. The reason that the sound of a piano, a person, a violin, etc. is different even for the same note is because the waveform of the sound is different.
  • an envelope may be considered when distinguishing sounds.
  • the envelope is the change of sound with time, and it means the time for a note to reach its peak, the time for the sound to become stable, the time for the sound to last, and the time until the sound disappears.
  • the envelope can vary depending on how the sound source generates sound. In an embodiment, whether the sound is generated from the same sound source may be determined according to three elements and an envelope of the sound.
  • the operation of separating and obtaining the unit sound data set 760 from the sound 750 comprises: the sound 750 according to the amplitude, frequency, phase, waveform and spectrum at least one unit sound data 761, 762 , 763 , and 764 ).
  • the sounds of four instruments are divided into four Unit sound data 761 , 762 , 763 , and 764 may be acquired.
  • the sound source image may be used to separate each unit sound data.
  • unit sound data corresponding to each split screen can be separated by referring to the shape of the person's mouth on each split screen.
  • image data may be additionally used.
  • FIG. 8 is a diagram in which the device 1000 according to an embodiment of the present disclosure separates the sound 850 according to the sound source images 821 and 822 and displays the separated unit sound data 861 and 862 into each sound source image 821 , 822) is a diagram for explaining the matching operation.
  • an image may include an image 810 and a sound 850 including sound source images 821 and 822 separated into individual sound sources.
  • the sound 850 may include a synthesis of sounds A1 and A2 generated from respective sound source images 821 and 822 .
  • the sound 850 may be divided into unit sound data 861 and 862 by applying a sound-image matching model including voice information corresponding to each sound source image 821 and 822 .
  • the separated unit sound data 861 and 862 may be matched with respective sound source images 821 and 822 according to voice information.
  • the operation of matching each sound source image and unit sound data may additionally use information obtained from the sound source image. For example, when the same person is speaking on the split screen at the same time, unit sound data may be matched to the split screen by referring to the shape of the person's mouth on each split screen. For example, when two or more instruments of the same type exist, unit sound data may be matched for each instrument with reference to a hand shape of a person playing the instrument.
  • the unit sound data 862 corresponding to the sound source image 822 may be muted.
  • the unit sound data 861 corresponding to the sound source image 821 may be muted.
  • FIG 9 is a view for explaining a specific embodiment of the operation ( S370 ) of individually adjusting the volume of unit sound data according to the motion of the tracked sound source by the device 1000 according to an embodiment of the present disclosure.
  • step S910 it is possible to obtain a volume curve of the total execution time of each unit sound data.
  • the level of the sensed volume for each unit sound data may be calculated over time.
  • a volume correction curve including adjustment information to be performed for each unit sound data may be obtained.
  • the volume correction curve may include information on whether to decrease or increase the volume of the unit sound data at a specific time within the entire execution time of the image. For example, when it is desired to keep the volume of a sound constant within the entire execution time of an image, the volume correction curve may be calculated as a difference between the volume curve and a preset output volume value.
  • step S930 the volume of each unit sound data may be individually adjusted over time based on the volume correction curve.
  • 10A, 10B, and 10C are diagrams illustrating an example in which the device 1000 acquires multi-channel output sound according to an embodiment of the present disclosure.
  • the device 1000 may capture an image including two sound sources SS101 and SS102.
  • the recorded input sound is mono audio, and when reproduced without post-processing in its state, input sounds IA101 and IA102 generated from two sound sources are displayed on the left channel LC and the right channel RC, respectively. can be played simultaneously.
  • the two sound sources SS101 and SS102 may be recognized as being in the same place.
  • the user cannot recognize the directions of the two sound sources SS101 and SS102.
  • the device 1000 transmits the sound to each of the sound sources SS101 and SS102.
  • the separated unit sound data may be rendered as a left channel (LC) or a right channel (RC) according to the position on the screen of each sound source image.
  • unit sound data corresponding to the sound source SS101 located on the left side of the screen is output through the left channel (LC)
  • unit sound data corresponding to the sound source SS102 located on the right side of the screen is output through the right channel (RC).
  • the output sound may be implemented as multi-channel audio having two channels LC and RC.
  • FIG 11 is a diagram illustrating an example in which the device 1000 individually adjusts the volume of unit sound data according to an embodiment of the present disclosure.
  • a person holding the device 1000 and taking a picture may be the sound source SS111 that directly generates a sound.
  • a person holding the device 1000 and taking a picture may appear on the screen, but may not appear.
  • the sound-image matching model may be used.
  • the unit sound data A2 corresponding to the other sound source SS112 displayed on the screen is separated, and the remaining sound data can be determined as the unit sound data A1 corresponding to the sound source SS111. .
  • unit sound data A1 generated by the sound source SS111 located close to the device 1000 is unit sound data generated by the sound source SS112 located far from the device 1000 .
  • the volume may be louder than the A2.
  • the device 1000 adjusts the volume of A1 and A2 to the same level by decreasing the volume of the unit sound data A1 and increasing the volume of the unit sound data A2 in order to improve the sound quality of the image.
  • the volume of the unit sound data A1 and A2 is adjusted to the same level, the overall sound volume of the image may be constantly maintained, so that the sound quality of the image may be improved.
  • FIG. 12 is a diagram illustrating an example in which the device 1000 individually adjusts the volume of unit sound data according to the motion of the tracked sound source according to an embodiment of the present disclosure.
  • the subject (sound source) SS120 being photographed by the device 1000 may be moving while generating a sound.
  • the subject may have an initial position SS120i and a final position SS120f.
  • the subject may move in a direction away from the device 1000 .
  • the initial position SS120i of the subject may be relatively close to the device 1000
  • the final position SS120f of the subject may be relatively far from the device 1000 .
  • the volume of the initial input sound Ai generated at the initial position SS120i of the subject may be high, and the volume may be decreased as the sound source moves away from the device 1000 .
  • the volume of the final input sound Af generated at the final location SS120f of the subject may be relatively low.
  • the device 1000 decreases the volume of the initial input sound Ai, increases the volume of the final input sound Af, etc. in order to improve the sound quality of the image. , it is possible to obtain a volume correction curve including information on adjusting the volume according to time.
  • the device 1000 may adjust the volume of the sound by using the obtained volume correction curve, and may maintain the volume of the output sound at the same level within the entire execution time of the image.
  • 13 is a diagram illustrating an example in which the device 1000 adjusts the volume of unit sound data according to the motion of the tracked sound source, and obtains an output sound having multi-channels from the adjusted unit sound data according to an embodiment of the present disclosure; It is a drawing.
  • the subject (sound source) SS130 being photographed by the device 1000 may move relative to the device 1000 while generating a sound.
  • the subject may be located on the far right side of the device 1000 , and may move toward the left side closer to the device 1000 toward the final time Tf.
  • the initial position SS130i of the subject may be relatively far from the device 1000
  • the final position SS130f of the subject may be relatively close to the device 1000 .
  • the volume of the initial input sound Ai generated from the initial position SS130i may be small. As the sound source SS130 approaches the device 1000, the volume increases. Referring to FIG. 13(c) , at the final time Tf, the volume of the final input sound Af generated from the final position SS130f is can be relatively large.
  • the device 1000 includes information on adjusting the volume over time, such as increasing the volume of the initial input sound Ai and decreasing the volume of the final input sound Af, in order to improve the sound quality of the image A volume correction curve can be obtained.
  • the device 1000 may adjust the volume of the sound by using the obtained volume correction curve, and may maintain the volume of the output sound at the same level within the entire execution time of the image.
  • the device 1000 may render the output sound as multi-channel audio according to the location of the sound source.
  • the sound source SS130i may be rendered by increasing the volume of the right channel RCi near the initial time Ti at which the sound source SS130i is located on the right side of the screen. Referring to FIG. 13B , at an initial time Ti, the volume of the right channel RCi may be high and the volume of the left channel LCi may be adjusted to be low.
  • the right channel (RCf) in the vicinity of the final time Tf when the sound source SS130f is located on the left side of the screen, the right channel ( RCf) can be rendered by reducing the volume.
  • the volume of the right channel RCf may be low and the volume of the left channel LCf may be adjusted to be large in the output sound.
  • FIG 14 is a diagram illustrating an example in which the device 1000 acquires an additional sound through the auxiliary input unit 2200 and obtains an output sound having multi-channels according to an embodiment of the present disclosure.
  • the device 1000 may acquire a sound including the sound A1 directly acquired through the input unit and the sound A2 acquired through the auxiliary input unit 2200 external to the device 1000 .
  • the auxiliary input unit 2200 may be, for example, a wearable device including a microphone.
  • the sound A1 directly acquired from the input unit of the device 1000 has a low volume and a low signal-to-noise ratio.
  • the auxiliary input unit 2200 is always located close to the sound source SS140, the sound A2 obtained through the auxiliary input unit 2200 has a large and clear volume and a high signal-to-noise ratio.
  • Signal-to-Noise Ratio is the ratio of signal strength to noise strength.
  • a signal in terms of a signal-to-noise ratio, a signal may mean valid acoustic data. A higher signal-to-noise ratio means less noise.
  • the device 1000 uses the sound A2 acquired from the auxiliary input unit 2200 to reduce noise of the sound and adjust the volume of the output sound to a preset level.
  • the device 1000 may render the output sound as multi-channel audio according to the location of the sound source SS140.
  • the sound source SS140 may be located on the right side of the screen.
  • the output sound may be rendered by adjusting the volume of the left channel LC to be small and the volume of the right channel RC to be large.
  • the input sound is It can process sound regardless of the number of channels.
  • the separated unit sound data may be improved by matching the separated sound source image and unit sound data, respectively, and adjusting the loudness of each unit sound data.
  • an embodiment of the present disclosure captures an image through an input unit included in the mobile device, and a processor included in the mobile device automatically performs sound processing on the captured image, thereby No sound equipment is required, and the user may not manually perform post-processing operations.
  • Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Additionally, computer-readable media may include computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media may typically include computer readable instructions, data structures, or other data in a modulated data signal such as program modules.
  • the computer-readable storage medium may be provided in the form of a non-transitory storage medium.
  • 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (eg, electromagnetic wave). It does not distinguish the case where it is stored as
  • the 'non-transitory storage medium' may include a buffer in which data is temporarily stored.
  • the method according to various embodiments disclosed in this document may be included and provided in a computer program product.
  • Computer program products may be traded between sellers and buyers as commodities.
  • the computer program product is distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)), or through an application store (eg Play StoreTM) or on two user devices (eg, It can be distributed (eg downloaded or uploaded) directly or online between smartphones (eg: smartphones).
  • a portion of the computer program product eg, a downloadable app
  • a machine-readable storage medium such as a memory of a manufacturer's server, a server of an application store, or a relay server. It may be temporarily stored or temporarily created.
  • unit may be a hardware component such as a processor or circuit, and/or a software component executed by a hardware component such as a processor.
  • the processor may consist of one or a plurality of processors.
  • one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-only processor such as an NPU.
  • DSP digital signal processor
  • One or a plurality of processors control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory.
  • the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.
  • a predefined action rule or artificial intelligence model is characterized in that it is created through learning.
  • being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden.
  • Such learning may be performed in the device itself on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server and/or system.
  • Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
  • the artificial intelligence model may be composed of a plurality of neural network layers.
  • Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and the plurality of weights.
  • the plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized.
  • the artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited thereto.
  • DNN Deep Neural Network
  • DNN Deep Belief Network
  • BBDNN Bidirectional Recurrent Deep Neural Network
  • Deep Q-Networks Deep Q-Networks
  • AI models can be created through learning.
  • being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden.
  • the artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and the plurality of weights.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • User Interface Of Digital Computer (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

L'invention concerne un dispositif et un procédé pour améliorer la qualité sonore d'une vidéo. Un procédé par lequel un dispositif améliore la qualité sonore d'une vidéo peut comprendre les étapes suivantes : l'acquisition d'une vidéo ; l'acquisition d'un son et d'images à partir de la vidéo acquise ; l'acquisition d'une image de source sonore indiquant au moins une source sonore à partir des images acquises ; l'acquisition d'au moins une donnée sonore unitaire correspondant à la ou aux sources sonores à partir du son acquis ; la mise en correspondance d'au moins une image de source sonore avec le ou les éléments de données sonores unitaires par l'application d'un modèle d'adaptation d'image sonore prédéfini ; le suivi du mouvement de la ou des sources sonores à partir de l'image source sonore ; et le réglage de la sonie individuelle des données sonores unitaires en fonction du mouvement suivi de la source sonore.
PCT/KR2021/002170 2020-09-15 2021-02-22 Dispositif et procédé pour améliorer la qualité sonore d'une vidéo WO2022059869A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2020-0118500 2020-09-15
KR1020200118500A KR20220036210A (ko) 2020-09-15 2020-09-15 영상의 음질을 향상시키는 디바이스 및 방법

Publications (1)

Publication Number Publication Date
WO2022059869A1 true WO2022059869A1 (fr) 2022-03-24

Family

ID=80776906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/002170 WO2022059869A1 (fr) 2020-09-15 2021-02-22 Dispositif et procédé pour améliorer la qualité sonore d'une vidéo

Country Status (2)

Country Link
KR (1) KR20220036210A (fr)
WO (1) WO2022059869A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0772875A (ja) * 1993-09-02 1995-03-17 Sega Enterp Ltd 画像及び音声処理装置
WO2006120829A1 (fr) * 2005-05-13 2006-11-16 Matsushita Electric Industrial Co., Ltd. Dispositif de separation de son melange
KR101561843B1 (ko) * 2014-05-13 2015-10-20 (주) 로임시스템 집음영역별 반향 제거를 위한 오디오 시스템
KR20180050652A (ko) * 2015-07-24 2018-05-15 사운드오브젝트 테크놀로지스 에스.에이 음향 신호를 사운드 객체들로 분해하는 방법 및 시스템, 사운드 객체 및 그 사용
US20200288256A1 (en) * 2019-03-08 2020-09-10 Lg Electronics Inc. Method and apparatus for sound object following

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0772875A (ja) * 1993-09-02 1995-03-17 Sega Enterp Ltd 画像及び音声処理装置
WO2006120829A1 (fr) * 2005-05-13 2006-11-16 Matsushita Electric Industrial Co., Ltd. Dispositif de separation de son melange
KR101561843B1 (ko) * 2014-05-13 2015-10-20 (주) 로임시스템 집음영역별 반향 제거를 위한 오디오 시스템
KR20180050652A (ko) * 2015-07-24 2018-05-15 사운드오브젝트 테크놀로지스 에스.에이 음향 신호를 사운드 객체들로 분해하는 방법 및 시스템, 사운드 객체 및 그 사용
US20200288256A1 (en) * 2019-03-08 2020-09-10 Lg Electronics Inc. Method and apparatus for sound object following

Also Published As

Publication number Publication date
KR20220036210A (ko) 2022-03-22

Similar Documents

Publication Publication Date Title
WO2011115430A2 (fr) Procédé et appareil de reproduction sonore en trois dimensions
WO2016117836A1 (fr) Appareil et procédé de correction de contenu
WO2014092509A1 (fr) Appareil à lunettes et procédé de commande d'appareil à lunettes, appareil audio et procédé pour fournir un signal audio, et appareil d'affichage
WO2019139301A1 (fr) Dispositif électronique et procédé d'expression de sous-titres de celui-ci
WO2018056624A1 (fr) Dispositif électronique et procédé de commande associé
WO2020231230A1 (fr) Procédé et appareil pour effectuer une reconnaissance de parole avec réveil sur la voix
WO2013019022A2 (fr) Procédé et appareil conçus pour le traitement d'un signal audio
WO2020017798A1 (fr) Procédé et système de synthèse musicale à l'aide de motifs/textes dessinés à la main sur des surfaces numériques et non numériques
WO2010087630A2 (fr) Procédé et appareil pour décoder un signal audio
WO2019112342A1 (fr) Appareil de reconnaissance vocale et son procédé de fonctionnement
WO2019125029A1 (fr) Dispositif électronique permettant d'afficher un objet dans le cadre de la réalité augmentée et son procédé de fonctionnement
WO2019124963A1 (fr) Dispositif et procédé de reconnaissance vocale
WO2021060680A1 (fr) Procédés et systèmes d'enregistrement de signal audio mélangé et de reproduction de contenu audio directionnel
WO2016089047A1 (fr) Procédé et dispositif de distribution de contenu
WO2014061931A1 (fr) Dispositif et procédé de lecture de son
WO2018012727A1 (fr) Appareil d'affichage et support d'enregistrement
WO2019190142A1 (fr) Procédé et dispositif de traitement d'image
WO2021060575A1 (fr) Serveur à intelligence artificielle et procédé de fonctionnement associé
EP3707678A1 (fr) Procédé et dispositif de traitement d'image
WO2022059869A1 (fr) Dispositif et procédé pour améliorer la qualité sonore d'une vidéo
WO2021010562A1 (fr) Appareil électronique et procédé de commande associé
WO2020096406A1 (fr) Procédé de génération de son et dispositifs réalisant ledit procédé
WO2020101174A1 (fr) Procédé et appareil pour produire un modèle de lecture sur les lèvres personnalisé
WO2022010177A1 (fr) Dispositif et procédé de génération de résumé vidéo
WO2022177211A1 (fr) Procédé et dispositif d'évaluation d'une qualité vidéo sur la base de la présence ou de l'absence d'une trame audio

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21869493

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21869493

Country of ref document: EP

Kind code of ref document: A1