WO2022059869A1

WO2022059869A1 - Device and method for enhancing sound quality of video

Info

Publication number: WO2022059869A1
Application number: PCT/KR2021/002170
Authority: WO
Inventors: 카주크제이쿱; 자르네키피오트르; 그루지악그루지고르; 카프카슬로보미르
Original assignee: 삼성전자 주식회사
Priority date: 2020-09-15
Filing date: 2021-02-22
Publication date: 2022-03-24
Also published as: KR20220036210A

Abstract

A device and a method for enhancing the sound quality of a video are provided. A method by which a device enhances the sound quality of a video may comprise the steps of: acquiring a video; acquiring sound and images from the acquired video; acquiring a sound source image indicating at least one sound source from the acquired images; acquiring at least one piece of unit sound data corresponding to the at least one sound source from the acquired sound; matching at least one sound source image with the at least one piece of unit sound data by applying a preset sound-image matching model; tracking the motion of the at least one sound source from the sound source image; and adjusting the individual loudness of unit sound data according to the tracked motion of the sound source.

Description

Device and method for improving video quality

The present disclosure relates to a device and method for improving the sound quality of an image, and more particularly, a device and method for improving the sound quality of an overall image by separating the sound for each sound source and individually adjusting the volume of the separated unit sound data is about

Filming is an action that captures the world around you. Every modern mobile device equipped with a camera has the ability to capture video. As mobile devices, such as smart phones, become widespread, the number of individuals taking and viewing images is increasing. Although the quality of video recordings through mobile devices has improved over time, most of them focus on improving the quality of recorded visual images or improving the visual user experience. On the other hand, the improvement of sound quality is hardly addressed.

In addition, with the spread of mobile devices, rather than watching the same video together through TV at home, each viewer can view a different video while moving in public transportation, in the office, or in the bathroom. It is often viewed on a mobile device. When watching an image using a personal mobile device, a headset or earphone is generally used to focus on the image and not to disturb the surroundings. Headsets and earphones support stereo-type sounds in which sounds reproduced from the left and right are different from each other. Therefore, even in the case of sound recorded as mono audio through a single microphone, it is necessary to convert it to a stereo format or another multi-channel format to improve sound quality.

In general mobile devices, in addition to the built-in microphone to improve the sound quality of the video, a separate microphone such as a shotgun microphone or a lapel microphone is used, or the video is compressed by moving the video to a computer such as a computer after shooting. Through a separate manual post-processing operation such as noise removal. Separate professional microphone equipment is expensive, and it is inconvenient to bring it with you every time you shoot. A separate post-processing process for sound quality improvement requires a video editing program and professional knowledge to handle the program, and it is difficult to directly edit an image on a mobile device such as a smart phone with a small screen. Therefore, it is not easy for a general user who wants to shoot and distribute an image with a smart phone to improve the sound quality of the image.

Accordingly, the sound quality of the video captured through the camera and microphone included in the mobile device such as a smart phone is automatically improved within the mobile device without requiring a separate sound equipment or post-processing operation. It requires skills to do it.

An embodiment of the present disclosure obtains a sound source image representing at least one sound source from an image of an image, separates the sound of the image into unit sound data according to whether the sound is generated from the same sound source, and includes the sound source image and the sound source image. A device capable of adjusting the number of channels of the output sound regardless of the number of channels of the input sound by matching each unit sound data and adjusting the loudness of each unit sound data, and improving the sound quality of the output image; and method can be provided.

In addition, according to an embodiment of the present disclosure, an image is captured through an input unit included in the mobile device, and a processor included in the mobile device automatically performs sound processing on the captured image to improve sound quality. Devices and methods can be provided that require no equipment and that users can not manually perform post-processing operations.

Disclosed as a technical means for achieving the above-described technical problem, a method for a device to improve the sound quality of an image includes: acquiring an image; acquiring a sound and an image from the acquired image; obtaining a sound source image representing at least one sound source from the obtained image; obtaining at least one unit sound data corresponding to the at least one sound source from the obtained sound; matching each of the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model; tracking the movement of the at least one sound source from the sound source image; and individually adjusting the loudness of the unit sound data according to the movement of the tracked sound source. The sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.

Disclosed as a technical means for achieving the above-described technical problem, the device includes: an input unit for acquiring an image; an output unit for outputting an output image; a memory storing a program including one or more instructions; and at least one processor executing one or more instructions stored in the memory. The at least one processor acquires an image by controlling the input unit, acquires a sound and an image from the acquired image, and receives at least one sound source from the acquired image Obtaining a sound source image representing, from the obtained sound, obtaining at least one unit sound data corresponding to the at least one sound source, and applying a preset sound-image matching model, the at least one sound source image and the Each of at least one unit sound data may be matched, the movement of the at least one sound source may be tracked from the sound source image, and the loudness of the unit sound data may be individually adjusted according to the tracked movement of the sound source. The sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.

As a technical means for achieving the above-described technical problem, the computer-readable recording medium may store a program for executing at least one of the embodiments of the disclosed method in a computer.

1 is a schematic diagram of a method for a device to improve sound quality of an image according to an embodiment of the present disclosure.

2 is a block diagram of a device according to an embodiment of the present disclosure.

3 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.

4 is a flowchart of a method of improving the sound quality of an image according to an embodiment of the present disclosure.

5 is a diagram for describing an operation in which a device acquires an additional sound through an auxiliary input unit according to an embodiment of the present disclosure.

FIG. 6 is a diagram for explaining an operation in which a device acquires at least one sound source image from an image according to an embodiment of the present disclosure.

7 is a diagram for describing an operation in which a device acquires at least one unit sound data from a sound according to an embodiment of the present disclosure.

FIG. 8 is a diagram for explaining an operation in which a device separates a sound according to a sound source image and matches the separated unit sound data to each sound source image according to an embodiment of the present disclosure.

9 is a view for explaining a specific embodiment of the operation of the device individually adjusting the volume of unit sound data according to the movement of the tracked sound source according to an embodiment of the present disclosure.

10A is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.

10B is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.

10C is a diagram illustrating an example in which a device acquires multi-channel output sound according to an embodiment of the present disclosure.

11 is a diagram illustrating an example in which a device individually adjusts a volume of unit sound data according to an embodiment of the present disclosure.

12 is a diagram illustrating an example in which a device individually adjusts a volume of unit sound data according to a motion of a tracked sound source according to an embodiment of the present disclosure.

13 is a diagram illustrating an example in which a device adjusts a volume of unit sound data according to a motion of a tracked sound source and obtains an output sound having multi-channels from the adjusted unit sound data according to an embodiment of the present disclosure.

14 is a diagram illustrating an example in which a device acquires an additional sound through an auxiliary input unit and acquires an output sound having multi-channels according to an embodiment of the present disclosure;

In one embodiment of the present disclosure, a method for a device to improve the sound quality of an image may be provided. The method includes the steps of acquiring an image, acquiring a sound and an image from the acquired image, acquiring a sound source image representing at least one sound source from the acquired image, the acquired Acquiring at least one unit sound data corresponding to at least one sound source from the sound, applying a preset sound-image matching model to match at least one sound source image and at least one unit sound data, respectively; It may include tracking the movement of at least one sound source from the sound source image, and individually adjusting the volume (loudness) of the unit sound data according to the tracked movement of the sound source. The sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.

In an embodiment, acquiring the image may include acquiring the image through an input unit included in the device, and the input unit may include a microphone for acquiring a sound and a camera for acquiring an image there is.

In an embodiment, acquiring the image may include acquiring the image through an input unit included in the device and an auxiliary input unit external to the device. The input unit may include a microphone for acquiring a sound and a camera for acquiring an image. The auxiliary input unit may include an auxiliary microphone for acquiring additional sound.

In one embodiment, the acquiring at least one unit sound data corresponding to the at least one sound source from the acquired sound may include dividing the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum. may include doing When two or more unit sound data having the same amplitude, frequency, phase, waveform, and spectrum exist, it may include separating the two or more unit sound data into respective unit sound data using a sound source image.

In one embodiment, the step of matching each of the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model includes additionally using information obtained from the sound source image to obtain at least one sound source It may include matching the image and the at least one unit sound data, respectively.

In an embodiment, the step of tracking the motion of the at least one sound source from the sound source image may include tracking the movement of the corresponding sound source through a state change of the sound source image.

In one embodiment, the step of tracking the motion of the at least one sound source from the sound source image includes the motion information of the device obtained from a motion sensor including an accelerometer, a gyroscope, and a magnetometer. Thus, it may include tracking the movement of the sound source through the state change of the sound source image.

In one embodiment, the step of individually adjusting the volume of the unit sound data according to the movement of the tracked sound source is performed for each unit sound data, the step of obtaining a volume curve of the total execution time of each unit sound data It may include obtaining a volume correction curve including adjustment information to be performed, and individually adjusting the volume of each unit sound data based on the volume correction curve.

In an embodiment, the method may further include obtaining an output sound from unit sound data whose volume is individually adjusted, and obtaining an output image from the output sound and the image.

In one embodiment, the step of obtaining an output sound from the unit sound data whose volume is individually adjusted may include rendering the unit sound data by classifying it into two or more channels, and obtaining an output sound having multiple channels. there is.

In an embodiment of the present disclosure, a device for improving the sound quality of an image may be provided. The device may include an input unit for acquiring an image, an output unit for outputting an output image, a memory storing a program including one or more instructions, and at least one processor executing one or more instructions stored in the memory. can The at least one processor acquires an image by controlling the input unit, acquires a sound and an image from the acquired image, and acquires a sound source image representing at least one sound source from the acquired image and, from the acquired sound, acquire at least one unit sound data corresponding to at least one sound source, and apply a preset sound-image matching model to match at least one sound source image and at least one unit sound data, respectively And, it is possible to track the movement of at least one sound source from the sound source image, and individually adjust the volume (loudness) of the unit sound data according to the movement of the tracked sound source. The sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.

In an embodiment, the input unit may include a microphone for acquiring a sound and a camera for acquiring an image.

In one embodiment, the processor may execute one or more instructions to obtain additional sound through an auxiliary microphone external to the device.

In one embodiment, the processor executes one or more instructions to separate the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum, wherein the amplitude, frequency, phase, waveform and spectrum are all equal. When two or more unit sound data exist, at least one unit sound data corresponding to at least one sound source is obtained from the acquired sound by separating two or more unit sound data into each unit sound data using a sound source image can do.

In one embodiment, the processor executes one or more instructions and additionally uses information obtained from the sound source image to match at least one sound source image and at least one unit sound data, respectively, to obtain a preset sound-image matching model By applying, it is possible to match at least one sound source image and at least one unit sound data, respectively.

In an embodiment, the processor may execute one or more instructions to track the motion of the sound source through a change in the state of the sound source image.

In one embodiment, the processor executes one or more instructions to use motion information of the device obtained from motion sensors including an accelerometer, a gyroscope, and a magnetometer to determine the state of the sound source image. Through the change, the movement of the corresponding sound source can be tracked.

In one embodiment, the processor executes one or more instructions to obtain a volume curve of the total execution time of each unit sound data, obtain a volume correction curve comprising adjustment information to be performed on each unit sound data, , by individually adjusting the volume of each unit sound data based on the volume correction curve, it is possible to individually adjust the volume of the unit sound data according to the movement of the tracked sound source.

In an embodiment, the processor may further include executing one or more instructions to obtain an output sound from the unit sound data whose volume is individually adjusted, and to obtain an output image from the output sound and the image.

In an embodiment, the processor may execute one or more instructions to classify and render unit sound data into two or more channels, and obtain an output sound having multi-channels.

In one embodiment of the present disclosure, a computer-readable recording medium in which a program for executing any one method according to the present disclosure in a computer is recorded may be provided.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present disclosure pertains can easily implement them. However, the present disclosure may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present disclosure in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

The terms used in the embodiments of the present disclosure have been selected as currently widely used general terms as possible while considering the functions of the present disclosure, but this may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. . In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in this specification should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

The singular expression may include the plural expression unless the context clearly dictates otherwise. Terms used herein, including technical or scientific terms, may have the same meanings as commonly understood by one of ordinary skill in the art described herein.

In the present disclosure, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as “…unit” and “…module” described in this specification mean a unit that processes at least one function or operation, which is implemented as hardware or software, or is a combination of hardware and software. can be implemented.

Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . Also, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated.

The expression “configured to (or configured to)” as used herein depends on the context, for example, “suitable for”, “having the capacity to” ”, “designed to”, “adapted to”, “made to”, or “capable of” can be used interchangeably. The term “configured (or configured to)” may not necessarily mean only “specifically designed to” in hardware. Instead, in some circumstances, the expression “a system configured to” may mean that the system is “capable of” with other devices or components. For example, the phrase “a processor configured (or configured to perform) A, B, and C” refers to a dedicated processor (eg, an embedded processor) for performing those operations, or by executing one or more software programs stored in memory; It may refer to a generic-purpose processor (eg, a CPU or an application processor) capable of performing corresponding operations.

In the present disclosure, 'video' means audiovisual material including an auditory sound and a visual screen. The visual composition of the image may be described as 'image', 'visual data' or 'picture', and the auditory composition of the image is 'audio', 'sound' )', 'acoustic', or 'sound data'.

In the present disclosure, 'sound quality' means the quality of sound. Sound quality may vary depending on various acoustic factors. For example, the amount of noise may be a criterion for sound quality, and the sound quality may vary according to the flatness of the frequency and the flatness of the volume.

In the present disclosure, a 'sound source' means a source of a sound from which a sound is generated. For example, a sound source may be a person, an animal, various musical instruments, or any object if a sound is generated therefrom.

In the present disclosure, 'sound corresponding to a specific sound source' means a sound generated from the specific sound source. For example, a sound corresponding to a specific person may mean a voice made by the person, and a sound corresponding to a specific animal may mean a cry of the corresponding animal.

In the present disclosure, “screen area” means an area on a screen in which a visual screen of an image is displayed (displayed). For example, the screen range may be an area defined by a border of an image captured from an image at a specific point in time.

In the present disclosure, 'monaural, monophonic: mono' audio means audio composed of one channel. Mono audio is recording through one microphone, and sound heard through one speaker may correspond to this. Even if the sound is recorded or reproduced through multiple speakers, if the sound is connected to only one channel, it may be mono audio. In mono audio, the same sound is played from all connected speakers.

In the present disclosure, 'multi-channel' audio means audio composed of two or more channels. For example, in stereo audio, which is a type of multi-channel audio, when listening through one speaker, signals of two channels are synthesized and reproduced as one, but two speakers (eg, a headset, an earphone ( earphone), different sounds are played from both speakers, and a sense of space and rich sound can be reproduced compared to mono audio.

Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

1 is a schematic diagram of a method of improving the sound quality of an image by a device 1000 according to an embodiment of the present disclosure.

Referring to FIG. 1 , an image input to the device 1000 may include an image 110 and a sound 150 . In an embodiment, the image 110 may be recorded through an input device such as a camera, and the sound 150 may be recorded through an input device such as a microphone. The device 1000 may acquire sound source images SS1 , SS2 , and SS3 separated into individual sound sources from the recorded image 110 . In an embodiment, the sound source image may be divided into a speaker (person) (SS1, SS2) and a background (eg, a person who is not a speaker, desk, chair, paper, and background images) (SS3).

The device 1000 may acquire the unit sound data set 160 obtained by dividing the sound 150 by the sound generated by each sound source. In an embodiment, the device 1000 may classify the separated unit sound data set 160 as human voices (UA1, UA2) or background sound (noise) (UA3).

The device 1000 may match the classified unit sound data set 160 to the separated sound source images SS1, SS2, and SS3 by applying a preset sound-image matching model, respectively. For example, unit sound data (UA1, UA2) classified as a human voice may be matched to sound source images (SS1, SS2) determined to be human, and other sounds (UA3) are in the background image (SS3). can be matched. When two or more people are talking, the device 1000 receives the separated unit sound data UA1 and UA2 through the 'person's face and the voice corresponding to the face' information included in the sound-image matching model, respectively. can be matched to the sound source images (SS1, SS2) of

The device 1000 may individually adjust the volume of each of the separated unit sound data UA1 , UA2 , and UA3 . For example, the volume of unit sound data UA3 corresponding to noise can be reduced, and the volume of unit sound data (UA1, UA2) corresponding to the speaker's conversation can be adjusted to a level corresponding to the input signal or a preset level. there is.

The device 1000 may resynthesize the output sound 170 from the adjusted unit sound data. In an embodiment, the output sound 170 may be synthesized in a stereo format, which is a type of multi-channel sound, for output to an output device such as headphones. For example, the output sound 170 may include a first channel 171 output to the left speaker LC and a second channel 173 output to the right speaker RC.

The device 1000 may determine the rendering channel of the unit sound data UA1 and UA2 corresponding to each of the sound source images SS1 and SS2 according to the relative positions within the screen range of the sound source images SS1 and SS2. For example, since the first sound source image SS1 is located on the left within the screen range, the first unit sound data UA1 corresponding to the first sound source image SS1 is output to the first channel ( 171) can be rendered. In addition, the second unit sound data UA2 corresponding to the second sound source image SS2 located on the right side within the screen range may be rendered on the second channel 173 output to the right speaker RC. In an embodiment, the device 1000 may adjust the number of channels of the output sound 170 irrespective of the number of channels of the input sound 150 , and may improve the sound quality of an image to output a sense of space and rich sound. there is.

2 is a block diagram of a device 1000 according to an embodiment of the present disclosure.

Referring to FIG. 2 , the device 1000 may include an input unit 1100 , a processor 1300 , a memory 1500 , an output unit 1700 , and a motion sensor 1900 . Not all of the components shown in FIG. 2 are essential components of the device 1000 . The device 1000 may be implemented by more components than the components shown in FIG. 2 , or the device may be implemented by fewer components than the components shown in FIG. 2 .

The input unit 1100 may acquire an image from the outside. In an embodiment, the input unit 1100 may include a recorder for acquiring a visual image and a recorder for acquiring an auditory sound. For example, the recording unit may include a camera (Camera), and the recording unit may include a microphone (Microphone, mic). In an embodiment, the input unit 1100 may have a single configuration that is not physically separated into a recording unit and a recording unit.

The output unit 1700 may output an output image to the outside. The output unit 1700 may include a display 1710 and an audio output unit 1720 .

The display 1710 may output a visual image by externally displaying it. In one embodiment, the display 1710 may include a panel. The display 1710 is, for example, a liquid crystal display, a thin film transistor-liquid crystal display, an organic light-emitting diode, a flexible display, 3 It may be configured as at least one of a 3D display and an electrophoretic display.

The audio output unit 1720 may reproduce and output an auditory sound to the outside. In an embodiment, the audio output unit 1720 may include a speaker. The audio output unit 1720 may include, for example, a single speaker, two or more speakers, a mono speaker, a stereo speaker, a surround speaker, a headset, an earphone ( earphone).

In an embodiment, the display 1710 and the audio output unit 1720 of the output unit 1700 may have a single structure that is not physically separated.

The memory 1500 may store a program to be executed by the processor 1300 to be described later in order to control the operation of the device 1000 . The memory 1500 may store a program including at least one instruction for controlling the operation of the device 1000 . Instructions and program codes readable by the processor 1300 may be stored in the memory 1500 . In one embodiment, the processor 1300 may be implemented to execute instructions or codes of a program stored in the memory 1500 . The memory 1500 may store data input to or output from the device 1000 .

Memory 1500 is, for example, a flash memory (flash memory), a hard disk (hard disk), a multimedia card micro type (multimedia card micro type), card type memory (eg, SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, It may include at least one type of storage medium among a magnetic disk and an optical disk.

Programs stored in the memory 1500 may be classified into a plurality of modules according to their functions. For example, the memory 1500 includes an acoustic image separation module 1510 , a sound source image acquisition module 1520 , a unit acoustic data acquisition module 1530 , a matching module 1540 , a sound source motion tracking module 1550 , and a volume may include an adjustment module 1560 . In addition, the memory 1500 may include an acoustic-image matching model 1570 , a deep neural network (DNN) 1580 , and a database 1590 .

The processor 1300 may control the overall operation of the device 1000 . For example, the processor 1300 executes programs stored in the memory 1500 , and thus an input unit 1100 , an output unit 1700 including a display 1710 , and an audio output unit 1720 , and a motion sensor 1900 . and overall control of the memory 1500 and the like.

The processor 1300 may be composed of hardware components that perform arithmetic, logic, input/output operations and signal processing. The processor 1300 is, for example, a central processing unit (Central Processing Unit), a microprocessor (microprocessor), a graphic processor (Graphic Processing Unit), ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable Logic Devices), and FPGAs (Field Programmable Gate Arrays) may be configured as at least one, but is not limited thereto.

The processor 1300 may acquire an image through the input unit 1100 by executing at least one instruction stored in the memory 1500 . The image may include an image that is visual data and a sound that is auditory data.

The processor 1300 may obtain a sound and an image from the acquired image by executing at least one instruction constituting the sound image separation module 1510 among programs stored in the memory 1500 . there is. In an embodiment, an image composed of a single mono file may be divided into a sound file that is auditory data and an image file that is visual data.

The processor 1300 executes at least one command constituting the sound source image obtaining module 1520 among the programs stored in the memory 1500, thereby generating a sound source image representing at least one sound source from the obtained image. can be obtained An image may be composed of one continuous screen that is not divided. In such a continuous screen, each object such as a person, an animal, or a thing may be separated. Each of the separated objects may be a sound source that generates a sound. In an embodiment, at least one sound source image may be obtained from an image using a deep neural network (DNN) or a database in which image files are accumulated.

The processor 1300 executes at least one instruction constituting the unit sound data acquisition module 1530 among the programs stored in the memory 1500, thereby determining whether the acquired sound is generated from the same sound source or not. It is possible to acquire unit sound data. The unit sound data acquisition module 1530 may separate an input sound composed of one channel into a plurality of channels of unit sound data generated from different sound sources. In an embodiment, the unit sound data acquisition module 1530 may separate sound into unit sound data using a deep neural network (DNN) or a database in which audio information is accumulated. Information of the separated unit sound data is transferred to and stored in the database 1570 stored in the memory 1500 , and the database 1570 may be updated. In an embodiment, a model for separating sound data may be preset and stored in a database.

The processor 1300 executes at least one instruction constituting the matching module 1540 among programs stored in the memory 1500, thereby applying a preset sound-image matching model 1570 to at least one sound source image and at least one unit of sound data may be respectively matched.

The sound-image matching model 1570 may include matching information between an image of a specific sound source and a sound generated by the specific sound source. In the sound and image matching information, sound information according to the information of the sound source image (eg, a barking sound corresponding to a dog image, a rustling sound corresponding to a tree image, or a specific person's information when two or more people are talking) Characteristics of a sound according to a specific image, such as voice information based on a face image) and information about a sound source image according to a specific sound (eg, mouth shape of the image when a specific sound is generated) may be included.

In an embodiment, the acoustic-image matching model 1570 may be established using a deep neural network (DNN) or a database in which image files and audio information are accumulated.

In an embodiment, the individual sound source images separated from the image and the individual unit sound data separated from the sound may be matched based on the correspondence information between the sound and the image included in the sound-image matching model 1570 , respectively. Information of each sound source image and unit sound data matched according to the preset sound-image matching model 1570 is transmitted back to the sound-image matching model 1570 stored in the memory 1500 and stored, and sound-image matching The model 1570 may be updated, or it may be transferred to and stored in the database 1590 and update the database 1590 .

The processor 1300 may track the motion of at least one sound source from the sound source image by executing at least one command constituting the sound source motion tracking module 1550 among programs stored in the memory 1500 . In an image including a moving image, the state or position of a specific sound source image may change over time. In an embodiment, the sound source motion tracking module 1550 analyzes an image (screen), analyzes the moving direction, speed, change of the shape of the sound source image, etc. of a specific sound source, and obtains a motion profile for a particular sound source can do. In one embodiment, the obtained motion profile may be used to individually adjust the volume of each unit acoustic data in a subsequent step.

The processor 1300 executes at least one command constituting the volume control module 1560 among the programs stored in the memory 1500 to individually adjust the volume of the unit sound data according to the movement of the tracked sound source. Can be adjusted. In an embodiment, the volume of the unit sound data may be adjusted so that the output sound maintains a constant volume in the entire image. For example, in the case of recording a person speaking, as the sound source (the person speaking) moves relatively far from the device 1000 during recording, the volume of the input unit sound data decreases. In this case, based on the volume when the sound source is close to the device 1000 , the volume may be adjusted so that the unit sound data may have a constant volume in the entire image. In this way, based on the motion profile of the sound source obtained in the sound source motion tracking module 1550 previously, the volume of unit sound data corresponding to the sound source may be individually adjusted.

In one embodiment, the processor 1300 executes at least one command constituting the volume control module 1560 among the programs stored in the memory 1500 to individually adjust the volume of unit sound data according to the type of sound source. may be For example, when specific unit sound data is classified as noise, the volume of the corresponding unit sound data may be adjusted to be small. In an embodiment, when output of only a specific type of sound is required, only a specific type of sound to be output from among unit sound data is filtered, and the volume of other types of unit sound data may be adjusted to be small like noise. there is.

A deep neural network (DNN) 1580 stored in the memory 1500 is a type of artificial neural network, and may have a feature of being composed of several hidden layers between an input layer and an output layer. A deep neural network (DNN) 1580 can model complex nonlinear relationships like a general artificial neural network. For example, in a deep neural network structure for an object identification model, each object may be expressed as a hierarchical configuration of image basic elements. In this case, the additional layers may aggregate the characteristics of the gradually gathered lower layers. This feature of the deep neural network (DNN) 1580 makes it possible to model complex data with fewer units. The deep neural network (DNN) 1580 may be applied to image recognition or speech recognition fields, and may be used for processing to separate images and match the separated images with respective voice information as in the present disclosure.

The database 1590 stored in the memory 1500 may be configured as a set of a vast amount of data. In an embodiment, the database 1590 may include sound information corresponding to a specific sound source image and image information corresponding to a specific sound. In an embodiment, the database 1590 may be used to set the sound-image matching model 1570 by acquiring matching information indicating the correspondence between the sound and the image. Also, the database 1590 may be used to match unit sound data and sound source images, or to adjust the volume of each unit sound data.

The processor 1300 may obtain an output sound from the unit sound data whose volume is individually adjusted, and an output image from the output sound and the image, by executing at least one command stored in the memory 1500 . In an embodiment, final output sound data may be obtained by resynthesizing unit sound data whose volume is individually adjusted. In an embodiment, the processor 1300 may classify and render unit sound data into two or more channels in order to obtain an output sound having a multi-channel, such as a stereo format. For example, unit sound data corresponding to sound source images disposed on the left side on the display screen may be rendered as a first channel, and unit sound data corresponding to sound source images disposed on the right side on the display screen may be rendered as a second channel. there is. Thereafter, an output sound in which the first channel is reproduced from the left speaker and the second channel is reproduced from the right speaker may be acquired. In an embodiment, the output sound may be not only in a stereo format, but also in a surround format, an Ambisonic format, or other multi-channel format.

In an embodiment, the device 1000 may further include a motion sensor 1900 . The motion sensor 1900 may include an accelerometer 1910 , a gyroscope 1920 , and a magnetometer 1930 . The motion sensor 1900 may detect a motion of the device 1000 . When there is a movement of the device 1000 itself including the input unit 1100 for acquiring an image, even if there is no actual movement of an object, it may be recognized that there is a movement of the sound source image in the acquired image. In an embodiment, based on the motion information of the device 1000 obtained from the motion sensor 1900, a motion profile of the sound source may be additionally obtained through a relative change on the screen of the sound source image. The obtained additional sound source motion profile may be used to adjust the volume of unit sound data matched to the corresponding sound source image.

Conventionally, in order to obtain an image including high-quality sound, it is necessary to use a professional sound recording device or to perform an image post-processing after recording with a general sound device. With the development of the Internet and social networks, more and more creators personally shoot, edit, and distribute videos. Rather than using professional equipment, such individual creators often use the camera and microphone included in mobile devices such as smart phones to shoot videos. In image capturing by a mobile device, there was a lot of improvement in the image processing area, which is visual data, but there was no significant improvement in the audio processing area, which is auditory data. Improving sound quality is important for viewing more realistic images.

In the device 1000 according to an embodiment of the present disclosure, when the processor 1300 executes one or more instructions stored in the memory 1500 , the image acquired from the input unit 1100 included in the device 1000 is displayed as visual data. Separating the image and audio data into sound, obtaining a sound source image representing at least one sound source from the image of the image, and separating the sound of the image into unit sound data according to whether or not it is generated from the same sound source, , by matching the sound source image and unit sound data, respectively, and adjusting the loudness of each unit sound data, the sound quality of the output image can be improved.

Therefore, even when recording using the input unit 1100 such as a microphone basically included in a mobile device such as a smart phone, the captured image is captured inside the device 1000 through the processor 1300 included in the device 1000 . It can be immediately post-processed. In this case, a separate audio equipment is not required to improve the sound quality of the captured image, and the mobile device automatically performs post-processing of the sound even if the user does not have a professional image post-processing technology, so that the image can be obtained.

In addition, in the device 1000 according to an embodiment of the present disclosure, the processor 1300 executes one or more instructions stored in the memory 1500 , thereby rendering the separated individual unit sound data into two or more different channels. can do. Accordingly, even when mono audio is recorded through a single microphone, the output image may have multi-channel stereo type sound, surround type sound, or ambisonic type sound. In this way, the number of channels of the output sound can be adjusted irrespective of the number of channels of the input sound, and high-quality sound for a more realistic image can be obtained.

In step S300, an image may be acquired. An image may mean an audiovisual expression drawn on a two-dimensional plane. The video may mean a moving video. In an embodiment, the image may be acquired through an input unit including a microphone for acquiring a sound and a camera for acquiring an image.

In operation S310, a sound may be obtained from the image. For example, the sound may include a human voice, an animal sound, a sound generated from an object, noise, and the like. In one embodiment, the sound may be single-channel mono audio recorded from a single microphone or multi-channel audio recorded from a plurality of microphones.

In step S320, an image may be obtained from the image. For example, the image may be visual data recorded from a camera. In an embodiment, the image may include sound source images of various sound sources.

In step S330, at least one unit sound data corresponding to at least one sound source may be obtained from the sound. For example, it is possible to obtain at least one unit sound data determined according to whether or not it is generated in the same sound source. In an embodiment, a sound composed of mono audio of a single channel may be divided into a plurality of unit sound data generated from different sound sources. In an embodiment, when dividing a sound into a plurality of unit sound data, an image may be used.

In operation S340, a sound source image representing at least one sound source may be obtained from the image. For example, from an image composed of continuous visual data, objects, such as a person, an animal, an object, a background, etc., each of which can be a sound source for generating a sound may be separated.

In step S350, by applying a preset sound-image matching model, at least one sound source image and at least one unit sound data may be matched, respectively. In an embodiment, the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source. In an embodiment, the acoustic-image matching model may be preset through a deep neural network (DNN). The sound and image matching information may include sound information according to information on a sound source image and information on a sound source image according to a specific sound.

In an embodiment, the sound source images separated from the image and the unit sound data separated from the sound may each be matched one-to-one, many-to-one, or many-to-many based on an acoustic-image matching model. In one embodiment, when matching unit sound data and sound source image, the movement of the sound source image and the change of the sound source image may be considered.

In step S360, the movement of at least one sound source may be tracked from the sound source image. The motion may be tracked for each sound source image. The motion profile of the sound source may be used to individually adjust the volume of each unit sound data in a subsequent step. In one embodiment, the movement of the sound source may be calculated and tracked through a change in the screen of the sound source image. In one embodiment, the motion of the sound source uses the motion information of the device obtained from a motion sensor including an accelerometer, a gyroscope, and a magnetometer to measure the relative change on the screen of the sound source image It can also be calculated and tracked through

In step S370, the volume (loudness) of the unit sound data may be individually adjusted according to the movement of the tracked sound source. In one embodiment, the volume of each unit sound data separated from the sound may be decreased or increased, or may be adjusted to have a constant volume in the entire image. By individually adjusting the volume of each unit sound data, it is possible to optimize and tune the overall sound.

In step S380 , an output sound may be obtained from unit sound data whose volume is individually adjusted. In an embodiment, final output sound data may be obtained by resynthesizing unit sound data whose volume is individually adjusted. In an embodiment, unit sound data may be classified into two or more channels and rendered, and output sound having a multi-channel such as a stereo format may be obtained. For example, the output sound may be resynthesized by adjusting the volume of unit sound data corresponding to noise to be small, or may be resynthesized to have multi-channels for a rich sound. The sound quality may be improved.

In step S390, an output image may be obtained from the output sound and image. When the output image is compared with the input image, the image (screen) is the same but includes the adjusted unit sound data, so the sound quality can be improved.

In step S400, an image including a sound and an image may be obtained, and in steps S410 and S420, the sound and image may be obtained by separating the image from the image.

In operation S430, a sound source image indicating at least one sound source may be obtained from the image. For example, from an image composed of continuous visual data, objects, such as a person, an animal, an object, a background, etc., each of which can be a sound source for generating a sound may be separated.

In step S440, at least one sound source image and a part of the sound may be matched by applying a preset sound-image matching model. In an embodiment, the sound-image matching model may include matching information between an image of a specific sound source and a sound generated by the specific sound source.

In step S450, at least one unit sound data corresponding to at least one sound source may be obtained from the sound and the separated sound source image. For example, it is possible to obtain at least one unit sound data determined according to whether or not it is generated in the same sound source. In an embodiment, when a sound composed of mono audio of a single channel is divided into a plurality of unit sound data generated from different sound sources, a sound source image may be used. For example, when sound is divided into unit sound data, it is possible to determine in advance which sound source exists by referring to the previously separated sound source image, and to preferentially separate the sound by the sound source. For example, a portion of the sound matched to each sound source image may be separated into each unit sound data.

The operation of preferentially separating the sound source image from the image and separating the unit sound data from the sound using the matched sound source image may be useful when there is a sound source that does not appear in the image of the video. For example, after separating sounds corresponding to each separated sound source image into respective unit sound data, unit sound data corresponding to a sound source not appearing in the image of the video may be obtained from the remaining sound data.

In step S460, the movement of at least one sound source may be tracked from the sound source image of the sound source. The motion of the sound source may be tracked for each sound source image. The motion profile of the sound source may be used to individually adjust the volume of each unit sound data in a subsequent step.

Thereafter, similarly to the above-described embodiment, the loudness of the unit sound data may be individually adjusted according to the movement of the sound source tracked in step S470, and output from the unit sound data whose volume is individually adjusted in step S480 sound can be obtained. The sound quality of the output sound may be improved compared to the initially input sound.

FIG. 5 is a diagram for explaining an operation in which the device 1000 acquires an additional sound through the auxiliary input unit 2100 according to an embodiment of the present disclosure.

The device 1000 according to an embodiment of the present disclosure may itself include an input unit including a microphone for acquiring a sound and a camera for acquiring an image.

In an embodiment, the device 1000 may acquire an additional sound through the auxiliary input unit 2100 external to the device 1000 . For example, the auxiliary input unit 2100 may include an auxiliary microphone such as a lapel microphone. The sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 may be sounds generated from the same sound source SS, and in an embodiment, the sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 The sound acquired through 2100 may constitute a multi-channel sound as sound generated from different sound sources.

When the sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 are generated from the same sound source SS, the sound directly obtained from the device 1000 and the sound obtained through the auxiliary input unit 2100 are obtained through the auxiliary input unit 2100 A sound can only have a difference in volume or signal-to-noise ratio. In this case, the sound input through the auxiliary input unit 2100 may be transmitted to the device 1000 and used for post-processing of an image together with the sound acquired by the device 1000 . For example, when a sound input through the auxiliary input unit 2100 has a better signal-to-noise ratio, the sound input through the auxiliary input unit 2100 may be used to remove acoustic noise of an image acquired by the device 1000 . can The sound acquired through the auxiliary input unit 2100 may support the sound acquired by the device 1000 itself, and does not completely replace the sound acquired by the device 1000 .

When the sound obtained directly by the device 1000 and the sound obtained through the auxiliary input unit 2100 constitute multi-channel sound as sounds of different channels, each sound is post-processed together or independently to create a new mono channel It is possible to obtain output sound in format or multi-channel format.

FIG. 6 is a diagram for explaining an operation in which the device 1000 acquires at least one sound source image from the image 610 according to an embodiment of the present disclosure.

The device 1000 may obtain a sound source image representing at least one sound source from the image 610 .

Referring to FIG. 6 , an image 610 of an acquired image may be composed of continuous visual data. From the continuous visual data image, it is possible to separate objects, such as people, animals, objects, and backgrounds, which can each be a sound source for generating a sound. In an embodiment, image analysis may be analyzed through deep learning or artificial intelligence, which is a deep neural network (DNN) technology, in which case high accuracy and various object recognition are possible.

Image recognition technology of artificial intelligence (AI) classifies images into several patterns and learns pattern-type data to determine what an image is when a new image is given. In one embodiment, the device 1000, through a deep neural network (DNN) or artificial intelligence (AI), in the image 610, human images (H1, H2, H3, H4, H5, H6) and dog images (D1) , D2) can be separated. The separated human images H1, H2, H3, H4, H5, and H6 and dog images D1 and D2 may be sound source images, respectively. In this way, the device 1000 may separate and obtain at least one sound source image from the image 610 .

7 is a diagram for describing an operation in which the device 1000 acquires at least one unit sound data set 760 from the sound 750 according to an embodiment of the present disclosure.

The unit sound data 760 may be determined according to whether it is generated from the same sound source. Sound has three components: intensity, tone, and pitch. These three factors correspond to the amplitude, waveform, and frequency of a sound wave, respectively. The larger the amplitude of the wave, the louder the sound, and the higher the frequency of the wave, the higher the sound. The timbre of a sound is determined by its waveform. The reason that the sound of a piano, a person, a violin, etc. is different even for the same note is because the waveform of the sound is different.

Also, an envelope may be considered when distinguishing sounds. The envelope is the change of sound with time, and it means the time for a note to reach its peak, the time for the sound to become stable, the time for the sound to last, and the time until the sound disappears. The envelope can vary depending on how the sound source generates sound. In an embodiment, whether the sound is generated from the same sound source may be determined according to three elements and an envelope of the sound.

In one embodiment, the operation of separating and obtaining the unit sound data set 760 from the sound 750 comprises: the sound 750 according to the amplitude, frequency, phase, waveform and spectrum at least one

unit sound data

761, 762 , 763 , and 764 ). For example, from the synthesized sound 750 , the sounds of four instruments are divided into four

Unit sound data

761 , 762 , 763 , and 764 may be acquired.

In one embodiment, when the amplitude, frequency, phase, waveform, and spectrum of two or more unit sound data are all the same, the sound source image may be used to separate each unit sound data. For example, if the image includes a split screen and the same person is speaking on each split screen at the same time, unit sound data corresponding to each split screen can be separated by referring to the shape of the person's mouth on each split screen. can For example, when two or more instruments of the same type exist, unit sound data may be separated for each instrument with reference to a hand shape of a person playing the instrument. As such, when it is difficult to separate unit sound data for each sound source as a characteristic of sound, image data may be additionally used.

8 is a diagram in which the device 1000 according to an embodiment of the present disclosure separates the sound 850 according to the

sound source images

821 and 822 and displays the separated

unit sound data

861 and 862 into each sound source image 821 , 822) is a diagram for explaining the matching operation.

Referring to FIG. 8 , an image may include an image 810 and a sound 850 including

sound source images

821 and 822 separated into individual sound sources. The sound 850 may include a synthesis of sounds A1 and A2 generated from respective

sound source images

821 and 822 . In one embodiment, the sound 850 may be divided into

unit sound data

861 and 862 by applying a sound-image matching model including voice information corresponding to each

sound source image

821 and 822 . . The separated

unit sound data

861 and 862 may be matched with respective

sound source images

821 and 822 according to voice information.

In an embodiment, the operation of matching each sound source image and unit sound data may additionally use information obtained from the sound source image. For example, when the same person is speaking on the split screen at the same time, unit sound data may be matched to the split screen by referring to the shape of the person's mouth on each split screen. For example, when two or more instruments of the same type exist, unit sound data may be matched for each instrument with reference to a hand shape of a person playing the instrument.

In an embodiment, when output of only a specific sound is required, only a specific type of sound to be output from among the

unit sound data

861 and 862 is filtered, and the volume of other types of unit sound data is adjusted to be small. may be

For example, referring to FIG. 8 , when it is desired to output only unit sound data 861 corresponding to the sound source image 821 , the unit sound data 862 corresponding to the sound source image 822 may be muted. In addition, when it is desired to output only the unit sound data 862 corresponding to the sound source image 822 , the unit sound data 861 corresponding to the sound source image 821 may be muted.

9 is a view for explaining a specific embodiment of the operation ( S370 ) of individually adjusting the volume of unit sound data according to the motion of the tracked sound source by the device 1000 according to an embodiment of the present disclosure.

In step S910, it is possible to obtain a volume curve of the total execution time of each unit sound data. For example, the level of the sensed volume for each unit sound data may be calculated over time.

In step S920, a volume correction curve including adjustment information to be performed for each unit sound data may be obtained. In an embodiment, the volume correction curve may include information on whether to decrease or increase the volume of the unit sound data at a specific time within the entire execution time of the image. For example, when it is desired to keep the volume of a sound constant within the entire execution time of an image, the volume correction curve may be calculated as a difference between the volume curve and a preset output volume value.

In step S930, the volume of each unit sound data may be individually adjusted over time based on the volume correction curve.

10A, 10B, and 10C are diagrams illustrating an example in which the device 1000 acquires multi-channel output sound according to an embodiment of the present disclosure.

Referring to FIG. 10A , according to an embodiment, the device 1000 may capture an image including two sound sources SS101 and SS102.

Referring to FIG. 10B , the recorded input sound is mono audio, and when reproduced without post-processing in its state, input sounds IA101 and IA102 generated from two sound sources are displayed on the left channel LC and the right channel RC, respectively. can be played simultaneously. In this case, the two sound sources SS101 and SS102 may be recognized as being in the same place. As such, in mono audio having a single channel, the user cannot recognize the directions of the two sound sources SS101 and SS102.

Referring to FIG. 10C , when the method for improving the sound quality of an image according to an embodiment of the present disclosure is applied to the recorded input sounds IA101 and IA102, the device 1000 transmits the sound to each of the sound sources SS101 and SS102. may be separated into unit sound data, and the separated unit sound data may be rendered as a left channel (LC) or a right channel (RC) according to the position on the screen of each sound source image. For example, unit sound data corresponding to the sound source SS101 located on the left side of the screen is output through the left channel (LC), and unit sound data corresponding to the sound source SS102 located on the right side of the screen is output through the right channel (RC). can render. Accordingly, the output sound may be implemented as multi-channel audio having two channels LC and RC.

11 is a diagram illustrating an example in which the device 1000 individually adjusts the volume of unit sound data according to an embodiment of the present disclosure.

In an embodiment, a person holding the device 1000 and taking a picture may be the sound source SS111 that directly generates a sound. A person holding the device 1000 and taking a picture may appear on the screen, but may not appear.

In the operation of dividing the sound into unit sound data, when the sound source image of the sound source SS111 is present on the screen, the sound-image matching model may be used. When the sound source image does not exist because the sound source SS111 does not appear on the screen, the unit sound data A2 corresponding to the other sound source SS112 displayed on the screen is separated, and the remaining sound data can be determined as the unit sound data A1 corresponding to the sound source SS111. .

Referring to (a) of FIG. 11 , in the input sound, unit sound data A1 generated by the sound source SS111 located close to the device 1000 is unit sound data generated by the sound source SS112 located far from the device 1000 . The volume may be louder than the A2.

Referring to FIG. 11B , the device 1000 adjusts the volume of A1 and A2 to the same level by decreasing the volume of the unit sound data A1 and increasing the volume of the unit sound data A2 in order to improve the sound quality of the image. can When the volume of the unit sound data A1 and A2 is adjusted to the same level, the overall sound volume of the image may be constantly maintained, so that the sound quality of the image may be improved.

12 is a diagram illustrating an example in which the device 1000 individually adjusts the volume of unit sound data according to the motion of the tracked sound source according to an embodiment of the present disclosure.

In an embodiment, the subject (sound source) SS120 being photographed by the device 1000 may be moving while generating a sound. For example, the subject may have an initial position SS120i and a final position SS120f. In an embodiment, the subject may move in a direction away from the device 1000 . In this case, the initial position SS120i of the subject may be relatively close to the device 1000 , and the final position SS120f of the subject may be relatively far from the device 1000 .

Referring to FIG. 12A , the volume of the initial input sound Ai generated at the initial position SS120i of the subject may be high, and the volume may be decreased as the sound source moves away from the device 1000 . The volume of the final input sound Af generated at the final location SS120f of the subject may be relatively low.

Referring to FIG. 12B , in one embodiment, the device 1000 decreases the volume of the initial input sound Ai, increases the volume of the final input sound Af, etc. in order to improve the sound quality of the image. , it is possible to obtain a volume correction curve including information on adjusting the volume according to time. The device 1000 may adjust the volume of the sound by using the obtained volume correction curve, and may maintain the volume of the output sound at the same level within the entire execution time of the image.

13 is a diagram illustrating an example in which the device 1000 adjusts the volume of unit sound data according to the motion of the tracked sound source, and obtains an output sound having multi-channels from the adjusted unit sound data according to an embodiment of the present disclosure; It is a drawing.

In an embodiment, the subject (sound source) SS130 being photographed by the device 1000 may move relative to the device 1000 while generating a sound. At the initial time Ti, the subject may be located on the far right side of the device 1000 , and may move toward the left side closer to the device 1000 toward the final time Tf. In this case, the initial position SS130i of the subject may be relatively far from the device 1000 , and the final position SS130f of the subject may be relatively close to the device 1000 .

Referring to FIG. 13A , at the initial time Ti, the volume of the initial input sound Ai generated from the initial position SS130i may be small. As the sound source SS130 approaches the device 1000, the volume increases. Referring to FIG. 13(c) , at the final time Tf, the volume of the final input sound Af generated from the final position SS130f is can be relatively large.

In an embodiment, the device 1000 includes information on adjusting the volume over time, such as increasing the volume of the initial input sound Ai and decreasing the volume of the final input sound Af, in order to improve the sound quality of the image A volume correction curve can be obtained. The device 1000 may adjust the volume of the sound by using the obtained volume correction curve, and may maintain the volume of the output sound at the same level within the entire execution time of the image.

In order to obtain more realistic sound quality, the device 1000 may render the output sound as multi-channel audio according to the location of the sound source. For example, the sound source SS130i may be rendered by increasing the volume of the right channel RCi near the initial time Ti at which the sound source SS130i is located on the right side of the screen. Referring to FIG. 13B , at an initial time Ti, the volume of the right channel RCi may be high and the volume of the left channel LCi may be adjusted to be low.

Referring to (d) of FIG. 13, in the vicinity of the final time Tf when the sound source SS130f is located on the left side of the screen, the right channel ( RCf) can be rendered by reducing the volume. For example, at the last time Tf, the volume of the right channel RCf may be low and the volume of the left channel LCf may be adjusted to be large in the output sound.

14 is a diagram illustrating an example in which the device 1000 acquires an additional sound through the auxiliary input unit 2200 and obtains an output sound having multi-channels according to an embodiment of the present disclosure.

In an embodiment, the device 1000 may acquire a sound including the sound A1 directly acquired through the input unit and the sound A2 acquired through the auxiliary input unit 2200 external to the device 1000 . The auxiliary input unit 2200 may be, for example, a wearable device including a microphone.

Referring to (a) of FIG. 14 , when the sound source SS140 is located far from the device 1000 , the sound A1 directly acquired from the input unit of the device 1000 has a low volume and a low signal-to-noise ratio. can On the other hand, since the auxiliary input unit 2200 is always located close to the sound source SS140, the sound A2 obtained through the auxiliary input unit 2200 has a large and clear volume and a high signal-to-noise ratio.

Signal-to-Noise Ratio (SNR) is the ratio of signal strength to noise strength. In an embodiment, in terms of a signal-to-noise ratio, a signal may mean valid acoustic data. A higher signal-to-noise ratio means less noise.

In an embodiment, in order to improve the sound quality of the image, the device 1000 uses the sound A2 acquired from the auxiliary input unit 2200 to reduce noise of the sound and adjust the volume of the output sound to a preset level. can

In order to obtain more realistic sound quality, the device 1000 may render the output sound as multi-channel audio according to the location of the sound source SS140. For example, referring to FIG. 14 , the sound source SS140 may be located on the right side of the screen. Referring to FIG. 14B , in this case, the output sound may be rendered by adjusting the volume of the left channel LC to be small and the volume of the right channel RC to be large.

According to an embodiment of the present disclosure, by separating a sound source image representing at least one sound source from an image of an image, and separating the sound of an image into unit sound data according to whether or not the sound of the image is generated from the same sound source, the input sound is It can process sound regardless of the number of channels. In addition, by rendering the separated unit sound data as a single channel or multiple channels, it is possible to adjust the number of channels of the output sound regardless of the channel of the input sound. According to an embodiment of the present disclosure, the sound quality of the output image may be improved by matching the separated sound source image and unit sound data, respectively, and adjusting the loudness of each unit sound data.

In addition, an embodiment of the present disclosure captures an image through an input unit included in the mobile device, and a processor included in the mobile device automatically performs sound processing on the captured image, thereby No sound equipment is required, and the user may not manually perform post-processing operations.

An embodiment of the present disclosure may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Additionally, computer-readable media may include computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media may typically include computer readable instructions, data structures, or other data in a modulated data signal such as program modules.

In addition, the computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (eg, electromagnetic wave). It does not distinguish the case where it is stored as For example, the 'non-transitory storage medium' may include a buffer in which data is temporarily stored.

According to an embodiment, the method according to various embodiments disclosed in this document may be included and provided in a computer program product. Computer program products may be traded between sellers and buyers as commodities. The computer program product is distributed in the form of a device-readable storage medium (eg compact disc read only memory (CD-ROM)), or through an application store (eg Play Store™) or on two user devices (eg, It can be distributed (eg downloaded or uploaded) directly or online between smartphones (eg: smartphones). In the case of online distribution, at least a portion of the computer program product (eg, a downloadable app) is stored at least on a machine-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or a relay server. It may be temporarily stored or temporarily created.

Also, in this specification, “unit” may be a hardware component such as a processor or circuit, and/or a software component executed by a hardware component such as a processor.

Functions related to artificial intelligence according to the present disclosure are operated through a processor and a memory. The processor may consist of one or a plurality of processors. In this case, one or more processors may be a general-purpose processor such as a CPU, an AP, a digital signal processor (DSP), or the like, a graphics-only processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-only processor such as an NPU. One or a plurality of processors control to process input data according to a predefined operation rule or artificial intelligence model stored in the memory. Alternatively, when one or more processors are AI-only processors, the AI-only processor may be designed with a hardware structure specialized for processing a specific AI model.

A predefined action rule or artificial intelligence model is characterized in that it is created through learning. Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden. Such learning may be performed in the device itself on which artificial intelligence according to the present disclosure is performed, or may be performed through a separate server and/or system. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and the plurality of weights. The plurality of weights of the plurality of neural network layers may be optimized by the learning result of the artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from the artificial intelligence model during the learning process is reduced or minimized. The artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), There may be a Deep Belief Network (DBN), a Bidirectional Recurrent Deep Neural Network (BRDNN), or a Deep Q-Networks, but is not limited thereto.

AI models can be created through learning. Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created means burden. The artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between the operation result of a previous layer and the plurality of weights.

The description of the present disclosure described above is for illustration, and those of ordinary skill in the art to which the present disclosure pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

The scope of the present disclosure is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present disclosure. do.

Claims

In a method for a device to improve the sound quality of an image,

acquiring an image;

acquiring a sound and an image from the acquired image;

obtaining a sound source image representing at least one sound source from the obtained image;

obtaining at least one unit sound data corresponding to the at least one sound source from the obtained sound;

matching each of the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model;

tracking the movement of the at least one sound source from the sound source image; and

individually adjusting the loudness of the unit sound data according to the movement of the tracked sound source;

including,

The sound-image matching model includes matching information between an image of a specific sound source and a sound generated by the specific sound source.
According to claim 1,

The step of acquiring the image is

Including acquiring an image through an input unit included in the device,

The method of claim 1, wherein the input unit includes a microphone for acquiring a sound and a camera for acquiring an image.
According to claim 1,

Obtaining at least one unit sound data corresponding to the at least one sound source from the obtained sound,

Separating the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum,

When two or more unit sound data having the same amplitude, frequency, phase, waveform and spectrum exist, using the sound source image to separate the two or more unit sound data into respective unit sound data, Method .
According to claim 1,

The step of matching the at least one sound source image and the at least one unit sound data by applying the preset sound-image matching model, respectively,

Comprising matching each of the at least one sound source image and the at least one unit sound data using the information obtained from the sound source image, respectively.
According to claim 1,

The step of tracking the movement of the at least one sound source from the sound source image includes tracking the movement of the sound source through a state change of the sound source image.
According to claim 1,

The step of individually adjusting the volume of the unit sound data according to the movement of the tracked sound source,

obtaining a volume curve of the total execution time of each unit sound data;

obtaining a volume correction curve including adjustment information to be performed for each of the unit sound data; and

and individually adjusting the volume of each unit sound data based on the volume correction curve.
According to claim 1,

classifying and rendering the unit sound data whose volume is individually adjusted into two or more channels, and obtaining an output sound having multiple channels; and

obtaining an output image from the output sound and the image;

A method further comprising:
In the device for improving the sound quality of video,

an input unit for acquiring an image;

an output unit for outputting an output image;

a memory storing a program including one or more instructions; and

at least one processor executing one or more instructions stored in the memory;

The at least one processor,

By controlling the input unit, an image is obtained,

Obtaining a sound and an image from the obtained image,

Obtaining a sound source image representing at least one sound source from the obtained image,

Obtaining at least one unit sound data corresponding to the at least one sound source from the acquired sound,

matching the at least one sound source image and the at least one unit sound data by applying a preset sound-image matching model,

tracking the movement of the at least one sound source from the sound source image,

Adjusting the volume (loudness) of the unit sound data individually according to the movement of the tracked sound source,

The sound-image matching model includes matching information between an image of a specific sound source and a sound generated by the specific sound source, a device.
9. The method of claim 8,

The processor executes the one or more instructions,

A device for acquiring additional sound through an auxiliary microphone external to the device.
9. The method of claim 8,

The processor executes the one or more instructions,

Separating the sound into at least one unit sound data according to amplitude, frequency, phase, waveform and spectrum,

When two or more unit sound data having all the same amplitude, frequency, phase, waveform and spectrum exist, by separating the two or more unit sound data into each unit sound data using the sound source image,

A device for acquiring at least one unit sound data corresponding to the at least one sound source from the acquired sound.
9. The method of claim 8,

The processor executes the one or more instructions,

By additionally matching the at least one sound source image and the at least one unit sound data using the information obtained from the sound source image,

A device for matching the at least one sound source image and the at least one unit sound data, respectively, by applying the preset sound-image matching model.
9. The method of claim 8,

The processor executes the one or more instructions,

A device that tracks the movement of the sound source through the state change of the sound source image.
9. The method of claim 8,

The processor executes the one or more instructions,

Acquire the volume curve of the total running time of each unit sound data,

Obtaining a volume correction curve including adjustment information to be performed for each of the unit sound data,

By individually adjusting the volume of each unit sound data based on the volume correction curve,

A device for individually adjusting the volume of the unit sound data according to the movement of the tracked sound source.
9. The method of claim 8,

The processor executes the one or more instructions,

The unit sound data whose volume is individually adjusted is classified and rendered into two or more channels, and an output sound having a multi-channel is obtained,

The device further comprising obtaining an output image from the output sound and the image.
A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 7 on a computer is recorded.