KR20240030483A

KR20240030483A - Electronic device performing noise cancellation using a plurality of artificial intelligence modules

Info

Publication number: KR20240030483A
Application number: KR1020220109589A
Authority: KR
Inventors: 홍경찬
Original assignee: 주식회사 이엠텍
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2024-03-07

Abstract

실시예인 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스 음향 입력 신호를 프로세서에 인가하는 입력부와, 음향 입력 신호를 프로세서로부터 프레임 단위로 인가 받아, 음향 입력 신호 프레임에서 잡음을 제거하기 위한 제 1 마스크를 추론하고, 추론된 제 1 마스크를 프로세서에 인가하는 제 1 인공 지능 모듈과, 음향 입력 신호를 프로세서로부터 프레임 단위로 인가 받아, 음향 입력 신호 프레임에서 잡음을 제거하기 위한 제 2 마스크를 추론하고, 추론된 제 2 마스크를 프로세서에 인가하는 제 2 인공 지능 모듈과, 입력부로부터의 음향 입력 신호를 프레임 단위로 제 1 및 제 2 인공 지능 모듈들 각각으로 인가하고, 제 1 및 제 2 인공 지능 모듈들 각각으로부터 제 1 및 제 2 마스크들을 인가 받아, 음향 입력 신호 프레임과 인가된 제 1 및 제 2 마스크들 각각을 곱하여, 제 1 마스크가 적용된 음향 입력 신호 프레임 및 제 2 마스크가 적용된 음향 입력 신호 프레임을 생성하고, 생성된 제 1 마스크가 적용된 음향 입력 신호 프레임의 제 1 신호대 잡음비와, 생성된 제 2 마스크가 적용된 음향 입력 신호 프레임의 제 2 신호대 잡음비의 크기에 따라서, 제 1 마스크나 제 2 마스크 또는 제 1 및 제 2 마스크의 산술 평균을 음향 입력 신호 프레임에 대한 최종 마스크로 적용하는 프로세서를 포함한다.An electronic device that performs noise removal using a plurality of artificial intelligence modules, which is an embodiment, an input unit for applying an acoustic input signal to a processor, and an input unit for receiving the acoustic input signal on a frame-by-frame basis from the processor and removing noise from the acoustic input signal frame. A first artificial intelligence module that infers a first mask and applies the inferred first mask to the processor, and a second mask that receives an audio input signal in units of frames from the processor and removes noise from the audio input signal frame. a second artificial intelligence module for inferring and applying the inferred second mask to the processor; applying an acoustic input signal from the input unit to each of the first and second artificial intelligence modules in units of frames; and Receive first and second masks from each of the intelligence modules, multiply the sound input signal frame and each of the applied first and second masks to obtain an sound input signal frame to which the first mask is applied and an sound input to which the second mask is applied. Generate a signal frame, and depending on the size of the first signal-to-noise ratio of the acoustic input signal frame to which the generated first mask is applied and the second signal-to-noise ratio of the acoustic input signal frame to which the generated second mask is applied, the first mask or the second mask and a processor for applying the 2 mask or the arithmetic mean of the first and second masks as a final mask for the acoustic input signal frame.

Description

Electronic device that performs noise removal using a plurality of artificial intelligence modules {ELECTRONIC DEVICE PERFORMING NOISE CANCELLATION USING A PLURALITY OF ARTIFICIAL INTELLIGENCE MODULES}

실시예는 전자 디바이스에 관한 것으로서, 특히 머신 러닝(기계 학습)을 통하여 음향의 특성에 따른 잡음 제거를 수행하는 서로 다른 복수의 인공 지능 모듈들을 이용하여 음성과 잡음이 포함된 음향에서 잡음을 보다 효과적으로 제거하는 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스에 관한 것이다.The embodiment relates to an electronic device, and in particular, uses a plurality of different artificial intelligence modules to remove noise according to the characteristics of the sound through machine learning (machine learning) to more effectively remove noise from sounds containing voice and noise. It relates to an electronic device that performs noise removal using a plurality of artificial intelligence modules.

최근에는 유선 또는 무선 방식의 이어폰이나 헤드폰의 형태의 장치를 착용한 상태로 대화에 있어 방해가 되는 소음을 상쇄시키는 기술로서 기계학습/딥러닝 등을 이용한 노이즈 리덕션(reduction/suppression) 기술이 널리 보급되고 있다. 이 방법은 보통 시간축, 주파축 등의 입력 신호에 대한 mask 등의 값을 추론함으로써 소음은 제거하고 음성신호는 유지할 수 있다. Recently, noise reduction/suppression technology using machine learning/deep learning has become widely used as a technology to cancel out noise that interferes with conversation while wearing devices in the form of wired or wireless earphones or headphones. It is becoming. This method can usually remove noise and maintain voice signals by inferring values such as masks for input signals such as the time axis and frequency axis.

종래의 잡음 제거 모드의 경우, 음성과 잡음이 공존하는 음향 환경에서 잡음의 제거를 효과적으로 수행하지 못하고 있다. In the case of the conventional noise removal mode, noise cannot be effectively removed in an acoustic environment where voice and noise coexist.

실시예는 머신 러닝을 통하여 음향의 특성에 따른 잡음 제거를 수행하는 서로 다른 복수의 인공 지능 모듈들을 이용하여 음성과 잡음이 포함된 음향에서 잡음을 보다 효과적으로 제거하는 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스를 제공하는 것을 목적으로 한다. The embodiment uses a plurality of different artificial intelligence modules to remove noise according to the characteristics of the sound through machine learning, and uses a plurality of artificial intelligence modules to more effectively remove noise from sound containing voice and noise. The object is to provide an electronic device that performs removal.

실시예인 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스 음향 입력 신호를 프로세서에 인가하는 입력부와, 신호대 잡음비가 0보다 큰 음성 및 잡음의 합성 신호에서 잡음을 제거하는 기계 학습을 수행하고, 음향 입력 신호를 프로세서로부터 프레임 단위로 인가 받아, 음향 입력 신호 프레임에서 잡음을 제거하기 위한 제 1 마스크를 추론하고, 추론된 제 1 마스크를 프로세서에 인가하는 제 1 인공 지능 모듈과, 신호대 잡음비가 0보다 작은 음성 및 잡음의 합성 신호에서 잡음을 제거하는 기계 학습을 수행하고, 음향 입력 신호를 프로세서로부터 프레임 단위로 인가 받아, 음향 입력 신호 프레임에서 잡음을 제거하기 위한 제 2 마스크를 추론하고, 추론된 제 2 마스크를 프로세서에 인가하는 제 2 인공 지능 모듈과, 입력부로부터의 음향 입력 신호를 프레임 단위로 제 1 및 제 2 인공 지능 모듈들 각각으로 인가하고, 제 1 및 제 2 인공 지능 모듈들 각각으로부터 제 1 및 제 2 마스크들을 인가 받아, 음향 입력 신호 프레임과 인가된 제 1 및 제 2 마스크들 각각을 곱하여, 제 1 마스크가 적용된 음향 입력 신호 프레임 및 제 2 마스크가 적용된 음향 입력 신호 프레임을 생성하고, 생성된 제 1 마스크가 적용된 음향 입력 신호 프레임의 제 1 신호대 잡음비와, 생성된 제 2 마스크가 적용된 음향 입력 신호 프레임의 제 2 신호대 잡음비의 크기에 따라서, 제 1 마스크나 제 2 마스크 또는 제 1 및 제 2 마스크의 산술 평균을 음향 입력 신호 프레임에 대한 최종 마스크로 적용하는 프로세서를 포함한다.An electronic device that performs noise removal using a plurality of artificial intelligence modules, which is an embodiment, has an input unit that applies an acoustic input signal to a processor, and performs machine learning to remove noise from a composite signal of voice and noise with a signal-to-noise ratio greater than 0. , a first artificial intelligence module that receives an acoustic input signal in frame units from a processor, infers a first mask for removing noise from the acoustic input signal frame, and applies the inferred first mask to the processor, and a signal-to-noise ratio Perform machine learning to remove noise from a composite signal of voice and noise smaller than 0, receive the audio input signal in frame units from the processor, infer a second mask to remove noise from the audio input signal frame, and perform inference. a second artificial intelligence module for applying the second mask to the processor, and applying an acoustic input signal from the input unit to each of the first and second artificial intelligence modules in units of frames, and each of the first and second artificial intelligence modules Receive first and second masks from and multiply the audio input signal frame and each of the applied first and second masks to generate an audio input signal frame to which the first mask is applied and an audio input signal frame to which the second mask is applied. And, depending on the size of the first signal-to-noise ratio of the acoustic input signal frame to which the generated first mask is applied and the second signal-to-noise ratio of the acoustic input signal frame to which the generated second mask is applied, the first mask, the second mask, or the second mask. and a processor that applies the arithmetic mean of the first and second masks as a final mask for the acoustic input signal frame.

또한, 제 1 신호대 잡음비와 제 2 신호대 잡음비가 0보다 큰 경우, 프로세서는 제 1 마스크를 음향 입력 신호 프레임에 대한 최종 마스크로 적용하는 것이 바람직하다.Additionally, when the first signal-to-noise ratio and the second signal-to-noise ratio are greater than 0, the processor preferably applies the first mask as the final mask for the audio input signal frame.

또한, 제 1 신호대 잡음비와 제 2 신호대 잡음비가 0보다 작은 경우, 프로세서는 제 2 마스크를 음향 입력 신호 프레임에 대한 최종 마스크로 적용하는 것이 바람직하다.Additionally, when the first signal-to-noise ratio and the second signal-to-noise ratio are less than 0, the processor preferably applies the second mask as the final mask for the audio input signal frame.

또한, 제 1 신호대 잡음비가 0보다 크고 제 2 신호대 잡음비가 0보다 작거나, 제 1 신호대 잡음비가 0보다 작고 제 2 신호대 잡음비가 0보다 큰 경우, 프로세서는 제 1 및 제 2 마스크의 산술 평균을 음향 입력 신호 프레임에 대한 최종 마스크로 적용하는 것이 바람직하다.Additionally, if the first signal-to-noise ratio is greater than 0 and the second signal-to-noise ratio is less than 0, or the first signal-to-noise ratio is less than 0 and the second signal-to-noise ratio is greater than 0, the processor calculates the arithmetic mean of the first and second masks. It is desirable to apply it as a final mask for the acoustic input signal frame.

또한, 전자 디바이스는 음향 신호에서 음향 환경의 종류를 판단하는 기계 학습을 수행하고, 프로세서로부터의 음향 입력 신호 프레임을 인가 받아 음향 환경의 종류를 추론하고, 추론된 음향 환경의 종류를 프로세서에 인가하는 제 3 인공 지능 모듈을 구비하는 것이 바람직하다.In addition, the electronic device performs machine learning to determine the type of acoustic environment from the acoustic signal, receives the acoustic input signal frame from the processor, infers the type of acoustic environment, and applies the inferred type of acoustic environment to the processor. It is desirable to have a third artificial intelligence module.

또한, 프로세서는 음향 입력 신호 프레임을 제 3 인공 지능 모듈에 인가하고, 제 3 인공 지능 모듈로부터 추론된 음향 환경의 종류를 인가 받고, 추론된 음향 환경의 종류가 대화인 경우에, 음향 입력 신호 프레임을 제 1 및 제 2 인공 지능 모듈들 각각으로 인가하는 것이 바람직하다.In addition, the processor applies an acoustic input signal frame to the third artificial intelligence module, receives approval of the type of acoustic environment inferred from the third artificial intelligence module, and when the type of the inferred acoustic environment is conversation, the acoustic input signal frame It is desirable to apply to each of the first and second artificial intelligence modules.

실시예는 머신 러닝을 통하여 음향의 특성에 따른 잡음 제거를 수행하는 서로 다른 복수의 인공 지능 모듈들을 이용하여, 복수의 마스크들이 각각 적용된 음향의 SNR들을 이용하여 적용될 마스크를 결정함으로써, 음성과 잡음이 포함된 음향에서 잡음을 보다 효과적으로 제거할 수 있는 효과가 있다.The embodiment uses a plurality of different artificial intelligence modules that perform noise removal according to the characteristics of the sound through machine learning, and determines the mask to be applied using the SNRs of the sound to which the plurality of masks are respectively applied, so that the voice and noise are reduced. It has the effect of more effectively removing noise from the included sound.

도 1은 실시예에 따른 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스의 제어 구성도이다.
도 2는 전자 디바이스에서의 처리 과정을 나타내는 모식도이다.1 is a control configuration diagram of an electronic device that performs noise removal using a plurality of artificial intelligence modules according to an embodiment.
Figure 2 is a schematic diagram showing a processing process in an electronic device.

이하에서, 실시예들은 도면을 통하여 상세하게 설명된다. 그러나, 이는 특정한 실시 형태에 대해 한정하려는 것이 아니며, 설명되는 실시예들은 그 실시예들의 다양한 변경(modification), 균등물(equivalent), 및/또는 대체물(alternative)을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다.Hereinafter, embodiments are described in detail through the drawings. However, this is not intended to be limiting to specific embodiments, and the described embodiments should be understood to include various modifications, equivalents, and/or alternatives of the embodiments. In connection with the description of the drawings, similar reference numbers may be used for similar components.

본 문서에서, "가진다", "가질 수 있다", "포함한다", 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 부품 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다.In this document, expressions such as “have,” “may have,” “includes,” or “may include” refer to the presence of the corresponding feature (e.g., a numerical value, function, operation, or component such as a part). , and does not rule out the existence of additional features.

본 문서에서, "A 또는 B", "A 또는/및 B 중 적어도 하나", 또는 "A 또는/및 B 중 하나 또는 그 이상" 등의 표현은 함께 나열된 항목들의 모든 가능한 조합을 포함할 수 있다. 예를 들면, "A 또는 B", "A 및 B 중 적어도 하나", 또는 "A 또는 B 중 적어도 하나"는, (1) 적어도 하나의 A를 포함, (2) 적어도 하나의 B를 포함, 또는 (3) 적어도 하나의 A 및 적어도 하나의 B 모두를 포함하는 경우를 모두 지칭할 수 있다.In this document, expressions such as “A or B,” “at least one of A or/and B,” or “one or more of A or/and B” may include all possible combinations of the items listed together. . For example, “A or B”, “at least one of A and B”, or “at least one of A or B” (1) includes at least one A, (2) includes at least one B, or (3) it may refer to all cases including both at least one A and at least one B.

본 문서에서 사용된 "제1", "제2", "첫째", 또는 "둘째" 등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. 예를 들면, 제1 사용자 기기와 제2 사용자 기기는, 순서 또는 중요도와 무관하게, 서로 다른 사용자 기기를 나타낼 수 있다. 예를 들면, 본 문서에 기재된 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 바꾸어 명명될 수 있다.As used herein, expressions such as "first", "second", "first", or "second" may describe various elements in any order and/or importance, and may refer to one element as another. It is only used to distinguish from components and does not limit the components. For example, a first user device and a second user device may represent different user devices regardless of order or importance. For example, a first component may be renamed a second component without departing from the scope of rights described in this document, and similarly, the second component may also be renamed to the first component.

어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "접속되어(connected to)" 있다고 언급된 때에는, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 상기 어떤 구성요소와 상기 다른 구성요소 사이에 다른 구성요소(예: 제3 구성요소)가 존재하지 않는 것으로 이해될 수 있다.A component (e.g., a first component) is “(operatively or communicatively) coupled with/to” another component (e.g., a second component). When referred to as being “connected to,” it should be understood that any component may be directly connected to the other component or may be connected through another component (e.g., a third component). On the other hand, when a component (e.g., a first component) is said to be “directly connected” or “directly connected” to another component (e.g., a second component), It may be understood that no other component (e.g., a third component) exists between other components.

본 문서에서 사용된 표현 "~하도록 구성된(또는 설정된)(configured to)"은 상황에 따라, 예를 들면, "~에 적합한(suitable for)", "~하는 능력을 가지는(having the capacity to)", "~하도록 설계된(designed to)", "~하도록 변경된(adapted to)", "~하도록 만들어진(made to)", 또는 "~를 할 수 있는(capable of)"과 바꾸어 사용될 수 있다. 용어 "~하도록 구성(또는 설정)된"은 하드웨어적으로 "특별히 설계된(specifically designed to)"것만을 반드시 의미하지 않을 수 있다. 대신, 어떤 상황에서는, "~하도록 구성된 장치"라는 표현은, 그 장치가 다른 장치 또는 부품들과 함께 "~할 수 있는" 것을 의미할 수 있다. 예를 들면, 문구 "A, B, 및 C를 수행하도록 구성(또는 설정)된 프로세서"는 해당 동작을 수행하기 위한 전용 프로세서(예: 임베디드 프로세서), 또는 메모리 장치에 저장된 하나 이상의 소프트웨어 프로그램들을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예: CPU 또는 application processor)를 의미할 수 있다.The expression “configured to” used in this document may mean, for example, “suitable for,” “having the capacity to,” or “having the capacity to.” It can be used interchangeably with ", "designed to," "adapted to," "made to," or "capable of." The term “configured (or set) to” may not necessarily mean “specifically designed to” in hardware. Instead, in some contexts, the expression “a device configured to” may mean that the device is “capable of” working with other devices or components. For example, the phrase "processor configured (or set) to perform A, B, and C" refers to a processor dedicated to performing the operations (e.g., an embedded processor), or executing one or more software programs stored on a memory device. By doing so, it may mean a general-purpose processor (eg, CPU or application processor) capable of performing the corresponding operations.

본 문서에서 사용된 용어들은 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 다른 실시 예의 범위를 한정하려는 의도가 아닐 수 있다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 용어들은 본 문서에 기재된 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 본 문서에 사용된 용어들 중 일반적인 사전에 정의된 용어들은 관련 기술의 문맥 상 가지는 의미와 동일 또는 유사한 의미로 해석될 수 있으며, 본 문서에서 명백하게 정의되지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다. 경우에 따라서, 본 문서에서 정의된 용어일지라도 본 문서의 실시 예들을 배제하도록 해석될 수 없다.Terms used in this document are merely used to describe specific embodiments and may not be intended to limit the scope of other embodiments. Singular expressions may include plural expressions, unless the context clearly dictates otherwise. Terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by a person of ordinary skill in the technical field described in this document. Among the terms used in this document, terms defined in general dictionaries may be interpreted to have the same or similar meaning as the meaning they have in the context of related technology, and unless clearly defined in this document, they may be interpreted in an ideal or excessively formal sense. It is not interpreted. In some cases, even terms defined in this document cannot be interpreted to exclude embodiments of this document.

도 1은 실시예에 따른 복수의 인공 지능 모듈들을 이용하여 잡음 제거를 수행하는 전자 디바이스의 제어 구성도이고, 도 2는 전자 디바이스에서의 처리 과정을 나타내는 모식도이다.FIG. 1 is a control configuration diagram of an electronic device that performs noise removal using a plurality of artificial intelligence modules according to an embodiment, and FIG. 2 is a schematic diagram showing a processing process in the electronic device.

전자 디바이스(10)는 음향을 획득하여 음향 입력 신호를 생성하여 프로세서(9)에 인가하는 입력부(1)와, 음향 출력 신호를 프로세서(9)로부터 인가 받아 음 방출하거나 전송하는 출력부(2)와, 전자 통신 장치(예를 들면, 스마트폰, 테블릿 등)와 통신을 수행하는 통신부(3)와, 인공 지능에 의한 기계 학습을 수행하여 프로세서(9)로부터의 주파수 영역의 음향 입력 신호를 인가 받아 음향 입력 신호의 음향 환경의 종류를 추론하여 프로세서(9)에 인가하는 제 1 인공 지능 모듈(4)과, 인공 지능에 의한 기계 학습을 수행하여 프로세서(9)로부터의 주파수 영역의 음향 입력 신호를 인가 받아 음향 입력 신호에 포함된 잡음을 제거하는 제 1 및 제 2 마스크나 마스크값을 각각 추론하여, 추론된 제 1 및 제 2 마스크나 마스크값을 프로세서(9)에 인가하는 제 2 및 제 3 인공 지능 모듈(5), (6)과, 상술된 구성요소들을 제어하여 통신부(3)를 통하여 전자 통신 장치와 통신하여 음향 재생 기능과 전화 통화 기능 및 청음 보조 기능(예를 들면, 보청 기능)을 수행하며, 제 1 인공 지능 모듈(4)로 프레임 단위의 주파수 영역의 음향 입력 신호를 인가 하고 제 1 인공 지능 모듈(4)로부터의 음향 환경의 종류에 따라서, 제 2 및 제 3 인공 지능 모듈(5), (6)을 이용한 잡음 제거 모드를 수행하되, 제 2 및 제 3 인공 지능 모듈(5), (6) 각각으로부터의 제 1 및 제 2 마스크 중의 어느 하나를 최종 마스크로 이용하거나 제 1 및 제 2 마스크의 산술 평균값을 최종 마스크로 이용하여, 주파수 영역의 음향 입력 신호와 최종 마스크를 곱하여 잡음이 제거된 주파수 영역의 음향 입력 신호를 생성하고, 생성된 주파수 영역의 음향 입력 신호를 시간 영역의 음향 입력 신호로 변환하여 음향 출력 신호로 출력부(2)에 인가하는 프로세서(9) 등을 포함하여 구성된다. 다만, 전원부(미도시), 통신부(3)는 본 발명이 속하는 기술분야에 익숙한 사람들에게 당연히 인식되는 기술에 해당되어, 그 상세한 설명이 생략된다.The electronic device 10 includes an input unit 1 that acquires sound, generates an acoustic input signal, and applies it to the processor 9, and an output unit 2 that receives the sound output signal from the processor 9 and emits or transmits the sound. , a communication unit 3 that communicates with an electronic communication device (e.g., smartphone, tablet, etc.), and a frequency domain acoustic input signal from the processor 9 by performing machine learning using artificial intelligence. A first artificial intelligence module (4) that receives authorization and infers the type of acoustic environment of the acoustic input signal and applies it to the processor (9), and performs machine learning using artificial intelligence to provide frequency domain acoustic input from the processor (9). A second and second processor for receiving a signal, inferring first and second masks or mask values for removing noise included in the acoustic input signal, respectively, and applying the inferred first and second masks or mask values to the processor 9. The third artificial intelligence module (5), (6) and the above-described components are controlled to communicate with the electronic communication device through the communication unit (3) to provide a sound reproduction function, a phone call function, and a hearing assistance function (e.g., hearing aid function). function), applies an acoustic input signal in the frequency domain on a frame basis to the first artificial intelligence module (4), and depending on the type of acoustic environment from the first artificial intelligence module (4), the second and third artificial intelligence Perform noise removal mode using the intelligence modules (5) and (6), and use one of the first and second masks from each of the second and third artificial intelligence modules (5) and (6) as the final mask. Alternatively, the arithmetic mean value of the first and second masks is used as the final mask, and the acoustic input signal in the frequency domain is multiplied by the final mask to generate an acoustic input signal in the frequency domain from which noise has been removed, and the generated acoustic input signal in the frequency domain is It is configured to include a processor 9 that converts an audio input signal in the time domain and applies it to the output unit 2 as an audio output signal. However, the power supply unit (not shown) and the communication unit 3 correspond to technologies naturally recognized by those familiar with the technical field to which the present invention pertains, and their detailed descriptions are therefore omitted.

입력부(1)는 노이즈 및/또는 음향 또는 음성을 획득하는 마이크나, 외부 장치로부터 시간 영역의 음성 입력 신호를 수신하거나 인가 받는 통신부로 구현될 수 있다.The input unit 1 may be implemented as a microphone that acquires noise and/or sound or voice, or as a communication unit that receives or receives a voice input signal in the time domain from an external device.

출력부(2)는 음성 출력 신호를 음 방출하는 스피커나, 외부 장치로 음성 출력 신호를 전송하거나 인가하는 통신부(3)로 구현될 수 있다.The output unit 2 may be implemented as a speaker that emits an audio output signal, or a communication unit 3 that transmits or applies the audio output signal to an external device.

제 1 내지 제 3 인공 지능 모듈들(4~6)은 예를 들면, RNN, LSTM, CNN, DNN 모델 등을 사용하여 딥러닝이나 기계 학습을 수행하는 알고리즘이거나, 그러한 알고리즘에 따른 연산을 수행하여 출력하는 실행기로 구현될 수 있다. 제 1 내지 제 3 인공 지능 모듈들(4~6)은 프로세서(9)로부터 프레임 마다의 주파수별 크기값와 위상을 포함하는 주파수 영역의 음향 입력 신호를 인가 받으며, 즉, 제 1 내지 제 3 인공 지능 모듈들(4~6)의 입력값은 주파수 영역의 음향 입력 신호이다. 실시예에서, 주로 입력값은 음향 입력 신호의 주파수별 크기(magnitude)이다.The first to third artificial intelligence modules 4 to 6 are, for example, algorithms that perform deep learning or machine learning using RNN, LSTM, CNN, DNN models, etc., or perform operations according to such algorithms. It can be implemented as an executor that outputs. The first to third artificial intelligence modules 4 to 6 receive frequency domain acoustic input signals including magnitude values and phases for each frequency for each frame from the processor 9, that is, the first to third artificial intelligence modules 4 to 6 The input value of modules 4 to 6 is an acoustic input signal in the frequency domain. In an embodiment, the input value is mainly the magnitude of each frequency of the acoustic input signal.

먼저, 제 1 인공 지능 모듈(4)은 프로세서(9)로부터 프레임 단위로 인가되는 주파수 영역의 음향 입력 신호(음향 입력 신호 프레임)를 입력값으로 하여, 음향 입력 신호의 음향 환경의 종류(예를 들면, 스피치(SPEECH), 경고(WARNING), 그 외 기타)를 판단하여 출력값으로 프로세서(9)에 음향 환경의 종류를 출력하는 인공 지능 모델이다. 제 1 인공 지능 모듈(4)에서, train 데이터는 각 음향 환경에 대응하는 음원 파일들이고, target 데이터는 train 데이터가 어떤 음향 환경의 파일인지를 나타내는 라벨(label)을 포함한다. 제 1 인공 지능 모듈(4)은 이러한 train 데이터와, target 데이터에 대한 학습 과정이 이전에 수행된다. First, the first artificial intelligence module 4 uses the frequency domain acoustic input signal (acoustic input signal frame) applied in frame units from the processor 9 as an input value, and determines the type of acoustic environment of the acoustic input signal (e.g. For example, it is an artificial intelligence model that judges speech (SPEECH, WARNING, etc.) and outputs the type of acoustic environment to the processor 9 as an output value. In the first artificial intelligence module 4, train data are sound source files corresponding to each acoustic environment, and target data includes a label indicating which acoustic environment file the train data is. The first artificial intelligence module 4 previously performs a learning process on the train data and target data.

제 2 및 3 인공 지능 모듈(5), (6) 각각은 프로세서(9)로부터 프레임 단위로 인가되는 주파수 영역의 음향 입력 신호(음향 입력 신호 프레임)를 입력 신호로 하고, 잡음이 제거된 음성 신호를 정답으로 전달하며, 주파수 영역의 음향 입력 신호의 피처값(feature)을 sigmoid 함수를 통해 잡음과 음성을 구별하여 잡음을 제거하기 위한 주파수별(또는 주파수 대역별) 마스크(mask)를 추론하여 프로세서(9)에 출력값으로 출력하는 동작을 수행한다. The second and third artificial intelligence modules 5 and 6 each use an audio input signal (acoustic input signal frame) in the frequency domain applied in frames from the processor 9 as an input signal, and an audio signal from which noise has been removed. is delivered as the correct answer, and the feature value of the acoustic input signal in the frequency domain is used to differentiate between noise and speech through the sigmoid function and infer a mask for each frequency (or frequency band) to remove noise. The operation of outputting the output value in (9) is performed.

본 실시예에서, 노이지(Noisy) 신호는 깨끗한 소리(clean sound)(음성)와 잡음(noise)을 포함하는 신호이고, 노이즈(Noise) 신호는 잡음만을 포함하는 신호이고, 클린(clean) 신호는 잡음이 없는 깨끗한 소리(음성)만을 포함하는 신호로서, 노이지 신호는 노이즈 신호와 클린 신호의 합성으로 만들어진다. In this embodiment, the noisy signal is a signal containing clean sound (voice) and noise, the noise signal is a signal containing only noise, and the clean signal is a signal containing only noise. As a signal containing only clean sound (voice) without noise, a noisy signal is created by combining a noise signal and a clean signal.

제 2 인공 지능 모듈(5)은 클린 신호와 노이즈 신호로부터 합성된 노이지 신호를 학습하되, 신호대 잡음비(SNR)가 0보다 큰 노이지 신호를 train 데이터로 하여 기계 학습을 수행하여 잡음을 제거하고 클린 신호인 음성을 위한 마스크를 추론하는 모듈이다. 반면에, 제 3 인공 지능 모듈(6)은 클린 신호와 노이즈 신호로부터 합성된 노이지 신호를 합성하되, 신호대 잡음비(SNR)가 0보다 작은 노이지 신호를 train 데이터로 하여 기계 학습을 수행하여 잡음을 제거하고 클린 신호인 음성을 위한 마스크를 추론하는 모듈이다. 이러한 mask의 추론 과정은 실시예가 속하는 기술 분야에 익숙한 통상의 기술자에게 당연히 인식되는 기술에 해당하여, 그 상세할 설명이 생략된다. 마스크가 '0'인 경우는 해당 주파수의 크기를 제거해야 하는 것(잡음이나 잡음이 많은 음성을 포함하는 음향)이고, 마스크가 '1'인 경우에는 해당 주파수의 크기를 유지(보존)해야 하는 것(음성이거나 잡음이 적은 음성을 포함하는 음향)을 의미한다. 제 2 및 제 3 인공 지능 모듈(5), (6) 각각이 추론하는 마스크는 0 이상이면서 1 이내의 범위에 속한다.The second artificial intelligence module 5 learns a noisy signal synthesized from a clean signal and a noise signal, and performs machine learning using the noisy signal with a signal-to-noise ratio (SNR) greater than 0 as train data to remove noise and generate a clean signal. This is a module that infers a mask for human speech. On the other hand, the third artificial intelligence module 6 synthesizes a noisy signal synthesized from a clean signal and a noise signal, and performs machine learning using the noisy signal with a signal-to-noise ratio (SNR) less than 0 as train data to remove noise. This is a module that infers a mask for voice, which is a clean signal. This mask inference process corresponds to a technique naturally recognized by a person skilled in the art who is familiar with the technical field to which the embodiment belongs, so detailed description thereof is omitted. If the mask is '0', the size of the relevant frequency must be removed (sound including noise or noisy speech), and if the mask is '1', the size of the relevant frequency must be maintained (preserved). means (voice or sound including voice with low noise). The mask inferred by the second and third artificial intelligence modules 5 and 6 respectively falls within the range of 0 or more and within 1.

단일의 인공 지능 모듈의 학습 과정에서, SNR이 0보다 큰 노이지 신호와, SNR이 0보다 작은 노이지 신호 각각을 학습할 경우, 너무 다양한 train 데이터인 음원으로부터 클린 신호로 정제하는 과정에서 상당한 혼란이 발생되는 문제가 발생되고 있기에, 본 실시예에서는 제 2 및 제 3 인공 지능 모듈(5), (6) 각각을 활용하고 있다. In the learning process of a single artificial intelligence module, when learning a noisy signal with an SNR greater than 0 and a noisy signal with an SNR less than 0, significant confusion occurs in the process of refining a clean signal from a sound source that is too diverse in train data. Since this problem is occurring, in this embodiment, the second and third artificial intelligence modules (5) and (6) are utilized, respectively.

프로세서(9)는 고속 푸리에 변환(FFT) 기능과 인버스 고속 푸리에 변환(IFFT) 기능을 구비하며, 연산 기능(예를 들면, 곱셈 기능, 아마다르 곱(element-wise multiplication 기능 등) 및 저장 기능(예를 들면, 메모리 등)을 구비하는 전자적 및/또는 전기적 회로 장치이다.The processor 9 is equipped with a fast Fourier transform (FFT) function and an inverse fast Fourier transform (IFFT) function, and has arithmetic functions (e.g., multiplication function, Amadard product (element-wise multiplication function, etc.)) and storage function ( It is an electronic and/or electrical circuit device including (for example, memory, etc.).

프로세서(9)는 입력부(1)로부터 인가되는 음향 입력 신호를 기설정된 크기의 프레임 단위로 FFT를 수행하여, 주파수 영역의 음향 입력 신호 프레임을 생성하여 제 1 인공 지능 모듈(4)에 지속적으로 인가한다. 본 실시예에서, 프로세서(9)에 의해 생성되는 주파수 영역의 음향 입력 신호 프레임은 그 생성되는 순서에 따라서, 제 1 프레임(Frame 1), 제 2 프레임(Frame 2), 제 3 프레임(Frame 3), ..., 제 n 프레임(Frame n)으로 지칭된다. The processor 9 performs FFT on the acoustic input signal applied from the input unit 1 in frame units of a preset size, generates an acoustic input signal frame in the frequency domain, and continuously applies it to the first artificial intelligence module 4. do. In this embodiment, the acoustic input signal frames in the frequency domain generated by the processor 9 are divided into a first frame (Frame 1), a second frame (Frame 2), and a third frame (Frame 3) according to the order in which they are generated. ), ..., is referred to as the nth frame (Frame n).

먼저, 프로세서(9)는 제 1 프레임(frame 1)을 제 1 인공 지능 모듈(4)로 인가한다. 제 1 인공 지능 모듈(4)은 인가된 제 1 프레임(frame 1)으로부터 음향 환경의 종류를 추론하고, 추론된 음향 환경의 종류를 프로세서(9)에 인가한다.First, the processor 9 applies the first frame (frame 1) to the first artificial intelligence module 4. The first artificial intelligence module 4 infers the type of acoustic environment from the approved first frame (frame 1) and applies the inferred type of acoustic environment to the processor 9.

프로세서(9)는 추론된 음향 환경의 종류에 따라서, 제 1 프레임(frame 1)을 제 2 및 제 3 인공 지능 모듈들(5, 6)에 인가할 것인지를 결정한다. 만약 추론된 음향 환경의 종류가 대화이면, 프로세서(9)는 잡음 제거 모드를 수행하기 위해 제 1 프레임(frame 1)을 제 2 및 제 3 인공 지능 모듈들(5), (6) 각각으로 인가한다. 만약 추론된 음향 환경의 종류가 대화가 아니면, 프로세서(9)는 잡음 제거 모드가 아닌 다른 모드를 수행하기 위해, 제 1 프레임을 제 2 및 제 3 인공 지능 모듈들(5, 6)로 인가하지 않는다.The processor 9 determines whether to apply the first frame (frame 1) to the second and third artificial intelligence modules 5 and 6 according to the type of the inferred acoustic environment. If the type of the inferred acoustic environment is conversation, the processor 9 applies the first frame (frame 1) to the second and third artificial intelligence modules 5 and 6, respectively, to perform the noise removal mode. do. If the inferred type of acoustic environment is not a conversation, the processor 9 does not transmit the first frame to the second and third artificial intelligence modules 5 and 6 to perform a mode other than the noise removal mode. No.

제 2 및 제 3 인공 지능 모듈들(5), (6) 각각은 제 1 프레임(frame 1)로부터 제 1 및 제 2 마스크(mask1), (mask2)를 각각 추론하고, 추론된 제 1 및 제 2 마스크(mask1), (mask2) 각각을 프로세서(9)에 인가한다. The second and third artificial intelligence modules 5 and 6 each infer the first and second masks (mask1) and (mask2) from the first frame (frame 1), and the inferred first and second masks (mask1) and (mask2) are respectively 2 Masks (mask1) and (mask2) are applied to the processor 9, respectively.

도 2에서와 같이, 프로세서(9)는 제 1 및 제 2 마스크(mask1), (mask2) 각각을 인가 받아, 제 1 프레임(Frame 1)과 곱셈을 수행하여 제 1 마스크(mask1)가 적용된 제 1 프레임(Mask1_1)과, 제 2 마스크(mask2)가 적용된 제 1 프레임(Mask2_1)을 생성한다. 또한, 프로세서(9)는 제 1 프레임(Mask1_1)과, 제 1 프레임(Mask2_1) 각각의 제 1 신호대 잡음비(SNR1)와, 제 2 신호대 잡음비(SNR2)를 각각 산정한다. 프로세서(9)는 제 1 신호대 잡음비(SNR1)와, 제 2 신호대 잡음비(SNR2) 각각과 기준값인 0을 비교한다. 만약 제 1 신호대 잡음비(SNR1)와, 제 2 신호대 잡음비(SNR2) 모두가 0 보다 크면, 프로세서(9)는 제 1 프레임(Frame 1)이 신호대 잡음비(SNR)이 0보다 큰 프레임인 것으로 판단하여, 제 1 마스크(mask1)를 제 1 프레임(Frame 1)에서 잡음을 제거하는데 사용하고, 즉 제 2 인공 지능 모듈(5)로부터의 제 1 마스크(mask1)를 제 1 프레임(Frame 1)에 대한 최종 마스크로 적용한다. 도 2의 제 1 프레임(Frame 1)의 경우로, 제 1 프레임(Mask1_1)에 대하여 IFFF를 수행하여 시간 영역의 음향 입력 신호 프레임을 생성하여 저장하거나 출력부(2)에 음향 출력 신호로 인가한다.As shown in FIG. 2, the processor 9 receives the first and second masks (mask1) and (mask2), respectively, performs multiplication with the first frame (Frame 1), and generates the first mask (mask1) to which the first mask (mask1) is applied. One frame (Mask1_1) and a first frame (Mask2_1) to which the second mask (mask2) is applied are generated. Additionally, the processor 9 calculates the first signal-to-noise ratio (SNR1) and the second signal-to-noise ratio (SNR2) of the first frame (Mask1_1) and the first frame (Mask2_1), respectively. The processor 9 compares the first signal-to-noise ratio (SNR1) and the second signal-to-noise ratio (SNR2) with a reference value of 0. If both the first signal-to-noise ratio (SNR1) and the second signal-to-noise ratio (SNR2) are greater than 0, the processor 9 determines that the first frame (Frame 1) is a frame with a signal-to-noise ratio (SNR) greater than 0. , the first mask (mask1) is used to remove noise from the first frame (Frame 1), that is, the first mask (mask1) from the second artificial intelligence module 5 is used for the first frame (Frame 1). Apply as a final mask. In the case of the first frame (Frame 1) in FIG. 2, IFFF is performed on the first frame (Mask1_1) to generate and store an audio input signal frame in the time domain or apply it to the output unit 2 as an audio output signal. .

만약 제 1 신호대 잡음비(SNR1)와, 제 2 신호대 잡음비(SNR2) 모두가 0 보다 작으면, 프로세서(9)는 제 1 프레임(Frame 1)이 신호대 잡음비(SNR)이 0 보다 작은 프레임인 것으로 판단하여, 제 2 마스크(mask2)를 제 1 프레임(Frame 1)에서 잡음을 제거하는데 사용하고, 즉 제 3 인공 지능 모듈(5)로부터의 제 2 마스크(mask2)를 제 1 프레임(Frame 1)에 대한 최종 마스크로 적용한다. If both the first signal-to-noise ratio (SNR1) and the second signal-to-noise ratio (SNR2) are less than 0, the processor 9 determines that the first frame (Frame 1) is a frame with a signal-to-noise ratio (SNR) less than 0. Therefore, the second mask (mask2) is used to remove noise from the first frame (Frame 1), that is, the second mask (mask2) from the third artificial intelligence module 5 is applied to the first frame (Frame 1). Apply as a final mask for

위의 2 경우 이외로, 제 1 신호대 잡음비(SNR1)는 0 보다 크고, 제 2 신호대 잡음비(SNR2)는 0 보다 작은 경우나, 제 1 신호대 잡음비(SNR1)는 0 보다 작고, 제 2 신호대 잡음비(SNR2)는 0 보다 큰 경우에는, 프로세서(9)는 제 1 및 제 2 마스크(mask1), (mask2)의 산술 평균을 제 1 프레임(Frame 1)에 대한 최종 마스크로 적용하여, 제 1 프레임(Frame 1)과 최종 마스크를 곱하여 잡음이 제거된 제 1 프레임(Frame 1)을 생성한다.In addition to the two cases above, the first signal-to-noise ratio (SNR1) is greater than 0 and the second signal-to-noise ratio (SNR2) is less than 0, or the first signal-to-noise ratio (SNR1) is less than 0 and the second signal-to-noise ratio ( If SNR2) is greater than 0, the processor 9 applies the arithmetic mean of the first and second masks (mask1) and (mask2) as the final mask for the first frame (Frame 1), Frame 1) and the final mask are multiplied to generate the first frame (Frame 1) from which noise has been removed.

도 2에서, 제 2 프레임(Frame 2)의 경우, 프로세서(9)는 제 3 인공 지능 모듈(6)의 제 2 마스크(mask2)를 최종 마스크로 적용하여 제 2 프레임(Mask2_2)에 대하여 IFFF를 수행하여 시간 영역의 음향 입력 신호 프레임을 생성하여 저장하거나 출력부(2)에 인가한다. 제 3 프레임(Frame 3)의 경우도 동일하다. In FIG. 2, in the case of the second frame (Frame 2), the processor 9 applies the second mask (mask2) of the third artificial intelligence module 6 as the final mask and sets IFFF for the second frame (Mask2_2). By performing this, an audio input signal frame in the time domain is generated and stored or applied to the output unit 2. The case of the third frame (Frame 3) is the same.

도 2에서, 제 n 프레임(Frame n)의 경우는 제 1 프레임(Frame 1)의 경우와 동일하다.In Figure 2, the case of the nth frame (Frame n) is the same as the case of the first frame (Frame 1).

도 2에서, 최종 프레임으로 결정된 프레임들에는 적색 사각 표시가 이해를 위해 도시되어 있다.In Figure 2, frames determined as final frames are marked with red squares for ease of understanding.

상술된 프레임은 주파수 영역의 음향 입력 신호이나, 시간 영역의 음향 입력 신호를 프레임 단위로 적용하여 잡음 제거 모드가 수행될 수도 있다.The above-described frame is an acoustic input signal in the frequency domain, but the noise removal mode may be performed by applying the acoustic input signal in the time domain on a frame-by-frame basis.

본 실시예에서, 전자 디바이스는 이어폰이나 헤드셋 등을 포함한다.In this embodiment, the electronic device includes earphones, a headset, etc.

다양한 실시 예에 따른 장치(예: 프로세서 또는 그 기능들) 또는 방법(예: 동작들)의 적어도 일부는, 예컨대, 프로그램 모듈의 형태로 컴퓨터로 읽을 수 있는 저장매체(computer-readable storage media)에 저장된 명령어로 구현될 수 있다. 상기 명령어가 프로세서에 의해 실행될 경우, 상기 하나 이상의 프로세서가 상기 명령어에 해당하는 기능을 수행할 수 있다. 컴퓨터로 읽을 수 있는 저장매체는, 예를 들면, 메모리가 될 수 있다.At least a portion of the device (e.g., processor or functions thereof) or method (e.g., operations) according to various embodiments is stored in a computer-readable storage media, for example, in the form of a program module. Can be implemented as stored instructions. When the instruction is executed by a processor, the one or more processors may perform the function corresponding to the instruction. A computer-readable storage medium may be, for example, memory.

컴퓨터로 판독 가능한 기록 매체는, 하드디스크, 플로피디스크, 마그네틱 매체(magnetic media)(예: 자기테이프), 광기록 매체(optical media)(예: CD-ROM, DVD(Digital Versatile Disc), 자기-광 매체(magnetoopticalmedia)(예: 플롭티컬 디스크(floptical disk)), 하드웨어 장치(예: ROM, RAM, 또는 플래시 메모리 등)등을 포함할 수 있다. 또한, 프로그램 명령에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 다양한 실시 예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지다.Computer-readable recording media include hard disks, floppy disks, magnetic media (e.g. magnetic tape), optical media (e.g. CD-ROM, DVD (Digital Versatile Disc), magnetic media) It may include magnetic media (e.g., a floptical disk), hardware devices (e.g., ROM, RAM, or flash memory, etc.), etc. Additionally, program instructions may include information such as those generated by the compiler. It may include not only machine language code but also high-level language code that can be executed by a computer using an interpreter, etc. The above-described hardware device may be configured to operate as one or more software modules to perform the operations of various embodiments, The same goes for the station.

다양한 실시 예에 따른 프로세서 또는 프로세서에 의한 기능들은 전술한 구성요소들 중 적어도 하나 이상을 포함하거나, 일부가 생략되거나, 또는 추가적인 다른 구성요소를 더 포함할 수 있다. 다양한 실시 예에 따른 모듈, 프로그램 모듈 또는 다른 구성요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱(heuristic)한 방법으로 실행될 수 있다. 또한, 일부 동작은 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.The processor or functions provided by the processor according to various embodiments may include at least one of the above-described components, some of them may be omitted, or other additional components may be included. Operations performed by modules, program modules, or other components according to various embodiments may be executed sequentially, in parallel, iteratively, or in a heuristic manner. Additionally, some operations may be executed in a different order, omitted, or other operations may be added.

이상 설명한 바와 같이, 상술한 특정의 바람직한 실시예들에 한정되지 아니하며, 청구범위에서 청구하는 요지를 벗어남이 없이 당해 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형의 실시가 가능한 것은 물론이고, 그와 같은 변경은 청구범위 기재의 범위 내에 있게 된다.As explained above, it is not limited to the specific preferred embodiments described above, and various modifications can be made by anyone skilled in the art without departing from the gist of the claims. Of course, such changes are within the scope of the claims.

1: 입력부 2: 출력부
3: 통신부 4: 제 1 인공 지능 모듈
5: 제 2 인공 지능 모듈 6: 제 3 인공 지능 모듈
9: 프로세서1: Input section 2: Output section
3: Communication Department 4: First Artificial Intelligence Module
5: Second artificial intelligence module 6: Third artificial intelligence module
9: processor

Claims

an input unit that applies an audio input signal to the processor;
Perform machine learning to remove noise from a composite signal of voice and noise with a signal-to-noise ratio greater than 0, receive the audio input signal on a frame-by-frame basis from the processor, and infer a first mask to remove noise from the audio input signal frame. and a first artificial intelligence module that applies the inferred first mask to the processor;
Perform machine learning to remove noise from a composite signal of voice and noise with a signal-to-noise ratio of less than 0, receive the audio input signal on a frame-by-frame basis from the processor, and infer a second mask to remove noise from the audio input signal frame. and a second artificial intelligence module for applying the inferred second mask to the processor;
An audio input signal from the input unit is applied to each of the first and second artificial intelligence modules in units of frames, first and second masks are received from each of the first and second artificial intelligence modules, and an audio input signal frame and Each of the applied first and second masks is multiplied to generate an acoustic input signal frame to which the first mask is applied and an acoustic input signal frame to which the second mask is applied, and the first of the generated acoustic input signal frames to which the first mask is applied is Depending on the signal-to-noise ratio and the magnitude of the second signal-to-noise ratio of the acoustic input signal frame to which the generated second mask is applied, the first mask or the second mask or the arithmetic mean of the first and second masks is used as the final signal for the acoustic input signal frame. An electronic device that performs noise removal using a plurality of artificial intelligence modules, characterized in that it includes a processor applied as a mask.

According to claim 1,
When the first signal-to-noise ratio and the second signal-to-noise ratio are greater than 0, the processor applies the first mask as the final mask for the acoustic input signal frame. An electronic device that performs noise removal using a plurality of artificial intelligence modules. device.

According to claim 1,
When the first signal-to-noise ratio and the second signal-to-noise ratio are less than 0, the processor applies the second mask as the final mask for the acoustic input signal frame. An electronic device that performs noise removal using a plurality of artificial intelligence modules. device.

According to claim 1,
When the first signal-to-noise ratio is greater than 0 and the second signal-to-noise ratio is less than 0, or the first signal-to-noise ratio is less than 0 and the second signal-to-noise ratio is greater than 0, the processor uses the arithmetic mean of the first and second masks as the acoustic input. An electronic device that performs noise removal using a plurality of artificial intelligence modules, characterized in that they are applied as a final mask to the signal frame.

According to claim 1,
The electronic device performs machine learning to determine the type of acoustic environment from the acoustic signal, receives the acoustic input signal frame from the processor, infers the type of acoustic environment, and applies the inferred type of acoustic environment to the processor. An electronic device that performs noise removal using a plurality of artificial intelligence modules, including an artificial intelligence module.

According to claim 5,
The processor applies an acoustic input signal frame to the third artificial intelligence module, receives approval of the type of acoustic environment inferred from the third artificial intelligence module, and, if the inferred type of acoustic environment is conversation, provides an acoustic input signal frame. An electronic device that performs noise removal using a plurality of artificial intelligence modules, characterized in that the application is applied to each of the first and second artificial intelligence modules.