KR20240059350A

KR20240059350A - Method and electronic device for processing voice signal de-identification

Info

Publication number: KR20240059350A
Application number: KR1020220140511A
Authority: KR
Inventors: 드미트로 프로고노프; 올렉산드라 소콜; 헤오르히 나우멘코; 비아체슬라프 데르카흐; 코스티안틴 볼로비에프
Original assignee: 삼성전자주식회사
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2024-05-07
Also published as: WO2024090711A1

Abstract

입력 음성 신호를 획득하는 단계, 상기 입력 음성 신호를 인공지능 모델에 입력함으로써, 상기 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성하는 단계, 상기 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계, 상기 부호화된 음성 정보를 외부 전자장치로 전송하는 단계를 포함하는 음성 신호 처리 방법이 제공된다. Obtaining an input voice signal, inputting the input voice signal into an artificial intelligence model, generating an output voice signal in which personal characteristics of the input voice signal are filtered, encoding the output voice signal, encoded voice information A voice signal processing method is provided, including generating and transmitting the encoded voice information to an external electronic device.

Description

Voice signal de-identification processing method and electronic device thereof {METHOD AND ELECTRONIC DEVICE FOR PROCESSING VOICE SIGNAL DE-IDENTIFICATION}

본 개시는 음성 신호 비식별화 처리 방법 및 그 전자 장치에 관한 것이다. 보다 구체적으로, 본 개시는 입력 음성 신호에 포함된 개인 특성이 필터링된 출력 음성 신호를 생성하는 처리 방법 및 그 전자 장치에 관한 것이다.The present disclosure relates to a voice signal de-identification processing method and an electronic device thereof. More specifically, the present disclosure relates to a processing method and electronic device for generating an output speech signal in which personal characteristics included in an input speech signal are filtered.

음성 신호에 대한 통신에 있어서, 음성을 부호화하고 복호화하는 기술이 이용된다. 음성을 부호화하고 복호화하는 기술은 높은 압축률을 통해 단위 시간당 많은 정보를 보내거나 음질을 향상시키는 방향으로 발전하고 있다. In communication regarding voice signals, techniques for encoding and decoding voice are used. Technology for encoding and decoding voice is developing to send more information per unit time or improve sound quality through high compression rates.

인공 지능 기술 분야의 발전에 따라, 사람의 음성을 모방하는 딥보이스 기술이 존재한다. 최근에는 적은 음성 샘플에 기초하여 사람의 음성을 모방할 수 있다. 이러한 적은 음성 샘플은 짧은 전화 통화에 의해서 생성될 수 있다. 따라서, 불특정인에 의한 음성 샘플 수집을 방지하기 위하여, 사용자 음성에 관한 보안 기술이 필요할 수 있다.With advancements in the field of artificial intelligence technology, deep voice technology that imitates human voices exists. Recently, it has been possible to imitate human speech based on a small number of speech samples. These small voice samples can be generated by short phone calls. Therefore, in order to prevent voice sample collection by unspecified persons, security technology for user voice may be necessary.

본 개시의 일 양태에 따라, 음성 신호 처리 방법이 제공된다. 음성 신호 처리 방법은 입력 음성 신호를 획득하는 단계를 포함할 수 있다. 음성 신호 처리 방법은 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성하는 단계를 포함할 수 있다. 음성 신호 처리 방법은 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. 음성 신호 처리 방법은 부호화된 음성 정보를 외부 전자장치로 전송하는 단계(540)를 포함할 수 있다. 인공지능 모델은 상기 부호화된 음성 정보에 기초하여 생성된 복원 음성 신호가 개인 특성을 포함하지 않도록 학습된 것일 수 있다.According to one aspect of the present disclosure, a method for processing voice signals is provided. A voice signal processing method may include obtaining an input voice signal. The voice signal processing method may include inputting an input voice signal into an artificial intelligence model, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. A voice signal processing method may include generating encoded voice information by encoding an output voice signal. The voice signal processing method may include a step 540 of transmitting encoded voice information to an external electronic device. The artificial intelligence model may be trained so that the restored voice signal generated based on the encoded voice information does not include personal characteristics.

본 개시의 일 양태에 따라, 음성 신호 처리를 위한 전자 장치가 제공된다. 전자 장치는 적어도 하나의 인스트럭션을 저장하는 메모리 및 적어도 하나의 인스트럭션에 따라 동작하는 적어도 하나의 프로세서를 포함할 수 있다. 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 획득할 수 있다. 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성할 수 있다. 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성할 수 있다. 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보를 외부 전자장치로 전송할 수 있다. 인공지능 모델은 부호화된 음성 정보에 기초하여 생성된 복원 음성 신호가 개인 특성을 포함하지 않도록 학습된 것일 수 있다.According to one aspect of the present disclosure, an electronic device for voice signal processing is provided. The electronic device may include a memory that stores at least one instruction and at least one processor that operates according to the at least one instruction. At least one processor may acquire an input voice signal by executing at least one instruction. At least one processor may input an input voice signal into an artificial intelligence model by executing at least one instruction, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. At least one processor may generate encoded speech information by executing at least one instruction to encode an output speech signal. At least one processor may transmit encoded voice information to an external electronic device by executing at least one instruction. The artificial intelligence model may be trained so that the restored voice signal generated based on encoded voice information does not include personal characteristics.

본 개시의 일 양태에 따라, 음성 신호 처리 방법을 컴퓨터가 수행하기 위한 인스트럭션들이 기록된, 컴퓨터로 읽을 수 있는 기록매체가 제공된다. According to one aspect of the present disclosure, a computer-readable recording medium on which instructions for a computer to perform a voice signal processing method are recorded is provided.

도 1은 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 설명하기 위한 도면이다.
도 2는 본 개시의 일 실시예에 따른, 음성 신호에 포함된 개인 특성을 설명하기 위한 도면이다.
도 3은 본 개시의 일 실시예에 따른, 음성 신호가 전송되는 과정을 설명하기 위한 도면이다.
도 4는 본 개시의 일 실시예에 따른, 입력 음성 신호로부터 복원 음성 신호를 생성하는 과정을 설명하기 위한 도면이다.
도 5는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법의 순서도이다.
도 6a는 본 개시의 일 실시예에 따른, 인공지능 모델의 학습 방법을 설명하기 위한 도면이다.
도 6b는 본 개시의 일 실시예에 따른, 인공지능 모델의 학습 방법을 설명하기 위한 도면이다.
도 7은 본 개시의 일 실시예에 따른, 사용자의 개인 정보를 제거하는 과정을 설명하기 위한 순서도이다.
도 8은 본 개시의 일 실시예에 따른, 비식별화된 음성 신호에 감정 정보를 추가하는 과정을 설명하기 위한 도면이다.
도 9는 본 개시의 일 실시예에 따른, 통신 채널의 종류에 기초하여 음성 신호를 수정하는 과정을 설명하기 위한 순서도이다.
도 10은 본 개시의 일 실시예에 따른, 통화 중에 사용자가 음성 신호 비식별화를 설정하는 방법을 설명하기 위한 도면이다.
도 11은 본 개시의 일 실시예에 따른, 통화 수신자가 음성 신호 비식별화를 설정하는 방법을 설명하기 위한 도면이다.
도 12는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법의 순서도이다.
도 13은 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 이용한 보안 과정을 설명하기 위한 도면이다.
도 14는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 이용한 보안 과정을 설명하기 위한 도면이다.
도 15는 본 개시의 일 실시예에 따른, 음성 신호를 처리하기 위한 전자 장치의 블록 구성도이다.
도 16은 본 개시의 일 실시예에 따른, 음성 신호를 처리하기 위한 전자 장치의 블록 구성도이다.1 is a diagram for explaining a voice signal processing method according to an embodiment of the present disclosure.
Figure 2 is a diagram for explaining personal characteristics included in a voice signal, according to an embodiment of the present disclosure.
Figure 3 is a diagram for explaining a process in which a voice signal is transmitted, according to an embodiment of the present disclosure.
FIG. 4 is a diagram illustrating a process for generating a restored voice signal from an input voice signal according to an embodiment of the present disclosure.
Figure 5 is a flowchart of a voice signal processing method according to an embodiment of the present disclosure.
FIG. 6A is a diagram for explaining a method of learning an artificial intelligence model according to an embodiment of the present disclosure.
FIG. 6B is a diagram for explaining a method of learning an artificial intelligence model according to an embodiment of the present disclosure.
Figure 7 is a flowchart for explaining a process for removing a user's personal information according to an embodiment of the present disclosure.
FIG. 8 is a diagram illustrating a process of adding emotional information to a de-identified voice signal according to an embodiment of the present disclosure.
FIG. 9 is a flowchart illustrating a process for modifying a voice signal based on the type of a communication channel according to an embodiment of the present disclosure.
FIG. 10 is a diagram illustrating a method for a user to set voice signal de-identification during a call, according to an embodiment of the present disclosure.
FIG. 11 is a diagram illustrating a method for a call recipient to set de-identification of a voice signal, according to an embodiment of the present disclosure.
Figure 12 is a flowchart of a voice signal processing method according to an embodiment of the present disclosure.
Figure 13 is a diagram for explaining a security process using a voice signal processing method according to an embodiment of the present disclosure.
Figure 14 is a diagram for explaining a security process using a voice signal processing method according to an embodiment of the present disclosure.
Figure 15 is a block diagram of an electronic device for processing voice signals, according to an embodiment of the present disclosure.
Figure 16 is a block diagram of an electronic device for processing voice signals, according to an embodiment of the present disclosure.

본 개시는 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고, 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 개시의 실시 형태에 대해 한정하려는 것이 아니며, 본 개시는 여러 실시예들의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present disclosure can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail through detailed description. However, this is not intended to limit the embodiments of the present disclosure, and the present disclosure should be understood to include all changes, equivalents, and substitutes included in the spirit and technical scope of the various embodiments.

실시예를 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 개시의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 명세서의 설명 과정에서 이용되는 숫자(예를 들어, 제 1, 제 2 등)는 하나의 구성요소를 다른 구성요소와 구분하기 위한 식별 기호에 불과하다.In describing the embodiments, if it is determined that detailed descriptions of related known technologies may unnecessarily obscure the gist of the present disclosure, the detailed descriptions will be omitted. In addition, numbers (eg, first, second, etc.) used in the description of the specification are merely identification symbols to distinguish one component from another component.

또한, 본 명세서에서, 일 구성요소가 다른 구성요소와 "연결된다" 거나 "접속된다" 등으로 언급된 때에는, 상기 일 구성요소가 상기 다른 구성요소와 직접 연결되거나 또는 직접 접속될 수도 있지만, 특별히 반대되는 기재가 존재하지 않는 이상, 중간에 또 다른 구성요소를 매개하여 연결되거나 또는 접속될 수도 있다고 이해되어야 할 것이다.In addition, in this specification, when a component is referred to as "connected" or "connected" to another component, the component may be directly connected or directly connected to the other component, but specifically Unless there is a contrary description, it should be understood that it may be connected or connected through another component in the middle.

또한, 본 명세서에서 '~부(유닛)', '모듈' 등으로 표현되는 구성요소는 2개 이상의 구성요소가 하나의 구성요소로 합쳐지거나 또는 하나의 구성요소가 보다 세분화된 기능별로 2개 이상으로 분화될 수도 있다. 또한, 이하에서 설명할 구성요소 각각은 자신이 담당하는 주기능 이외에도 다른 구성요소가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성요소 각각이 담당하는 주기능 중 일부 기능이 다른 구성요소에 의해 전담되어 수행될 수도 있음은 물론이다.In addition, in this specification, components expressed as 'unit (unit)', 'module', etc. are two or more components combined into one component, or one component is divided into two or more components for each more detailed function. It may be differentiated into In addition, each of the components described below may additionally perform some or all of the functions of other components in addition to the main functions that each component is responsible for, and some of the main functions of each component may be different from other components. Of course, it can also be performed exclusively by a component.

본 개시의 일 실시예에 따른 전자 장치의 음성 신호 처리 방법에 있어서, 음성 신호 처리를 위해 사용자의 음성을 인식하고 의도를 해석하기 위한 방법으로, 마이크를 통해 아날로그 신호인 음성 신호를 수신하고, ASR(Automatic Speech Recognition)모델을 이용하여 음성 부분을 컴퓨터로 판독 가능한 텍스트로 변환할 수 있다. 자연어 이해(Natural Language Understanding, NLU) 모델을 이용하여 변환된 텍스트를 해석하여, 사용자의 발화 의도를 획득할 수 있다. 여기서 ASR 모델 또는 NLU 모델은 인공지능 모델일 수 있다. 인공지능 모델은 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계된 인공지능 전용 프로세서에 의해 처리될 수 있다. 인공지능 모델은 학습을 통해 만들어 질 수 있다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. In the voice signal processing method of an electronic device according to an embodiment of the present disclosure, as a method for recognizing the user's voice and interpreting the intention for voice signal processing, a voice signal, which is an analog signal, is received through a microphone, and ASR Using the (Automatic Speech Recognition) model, speech parts can be converted into text that can be read by a computer. By interpreting the converted text using a Natural Language Understanding (NLU) model, the user's speech intention can be obtained. Here, the ASR model or NLU model may be an artificial intelligence model. Artificial intelligence models can be processed by an artificial intelligence-specific processor designed with a hardware structure specialized for processing artificial intelligence models. Artificial intelligence models can be created through learning. Here, being created through learning means that the basic artificial intelligence model is learned using a large number of learning data by a learning algorithm, thereby creating a predefined operation rule or artificial intelligence model set to perform the desired characteristics (or purpose). It means burden. An artificial intelligence model may be composed of multiple neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and neural network calculation is performed through calculation between the calculation result of the previous layer and the plurality of weights.

언어적 이해는 인간의 언어/문자를 인식하고 응용/처리하는 기술로서, 자연어 처리(Natural Language Processing), 기계 번역(Machine Translation), 대화 시스템(Dialog System), 질의 응답(Question Answering), 음성 인식/합성(Speech Recognition/Synthesis) 등을 포함한다.Linguistic understanding is a technology that recognizes and applies/processes human language/characters, including Natural Language Processing, Machine Translation, Dialog System, Question Answering, and Voice Recognition. /Speech Recognition/Synthesis, etc.

또한, 본 명세서에서, '입력 음성 신호'는 입력 장치(예: 마이크 등)를 통해 획득된 사용자 음성 신호를 의미할 수 있다. '입력 음성 신호'는 비식별화 처리가 되지 않은 음성 신호를 의미할 수 있다. 입력 음성 신호는 아날로그 형태의 신호를 의미할 수도 있고, 디지털 형태의 신호를 의미할 수도 있다. 본 명세서에서, '원본 음성 신호'는 '입력 음성 신호'와 동일한 의미를 나타낼 수 있다.Additionally, in this specification, 'input voice signal' may mean a user voice signal acquired through an input device (eg, microphone, etc.). ‘Input voice signal’ may mean a voice signal that has not been de-identified. The input voice signal may refer to an analog signal or a digital signal. In this specification, 'original voice signal' may have the same meaning as 'input voice signal'.

또한, 본 명세서에서, '출력 음성 신호'는 입력 음성 신호를 입력으로 하고, 인공지능 모델에 의하여 생성된 음성 신호를 의미할 수 있다. '출력 음성 신호' 비식별화 처리가 된 음성 신호를 의미할 수 있다. 출력 음성 신호는 아날로그 형태의 신호를 의미할 수도 있고, 디지털 형태의 신호를 의미할 수도 있다.Additionally, in this specification, 'output voice signal' may refer to a voice signal generated by an artificial intelligence model using an input voice signal as an input. ‘Output voice signal’ may refer to a voice signal that has been de-identified. The output voice signal may refer to an analog signal or a digital signal.

또한, 본 명세서에서, '음성 신호 비식별화(voice signal de-identification)'은 음성 신호에 포함된 개인 특성을 제거하는 과정을 의미할 수 있다. 본 개시의 일 실시예에 따라, 개인 특성은 사용자의 억양, 사용자의 발음 또는 사용자의 톤 중에서 적어도 하나를 포함할 수 있다.Additionally, in this specification, 'voice signal de-identification' may refer to a process of removing personal characteristics included in a voice signal. According to an embodiment of the present disclosure, the personal characteristic may include at least one of the user's accent, the user's pronunciation, or the user's tone.

또한, 본 명세서에서 '개인 특성(personal characteristic)'은 사용자별로 구별되는 음성 특징을 의미할 수 있다. 예를 들면, 개인 특성은 사용자의 억양, 사용자의 발음, 또는 제2 사용자의 톤 중 적어도 하나를 포함할 수 있다. 또한, 본 명세서에서 개인 특성은 '개인 데이터(personal data)'와 동일한 의미로 지칭될 수 있다.Additionally, in this specification, ‘personal characteristic’ may mean a voice characteristic that is differentiated for each user. For example, the personal characteristic may include at least one of the user's accent, the user's pronunciation, or the second user's tone. Additionally, in this specification, personal characteristics may be referred to with the same meaning as 'personal data'.

도 1은 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 설명하기 위한 도면이다.1 is a diagram for explaining a voice signal processing method according to an embodiment of the present disclosure.

도 1을 참조하면, 제2 사용자(140)는 제1 사용자(110)의 비식별화된 음성 신호(de-identification voice signal)을 들을 수 있다. Referring to FIG. 1, the second user 140 can hear the de-identified voice signal of the first user 110.

제1 전자 장치(120)는 제1 사용자(110)의 원본 음성 신호를 획득할 수 있다. 예를 들면, 제1 전자 장치(120)에 존재하는 마이크를 통해 제1 사용자(110)의 원본 음성 신호를 획득할 수 있다. 예를 들면, 제1 전자 장치(120)는 외부에서 녹음된 원본 음성에 관한 정보를 수신함으로써 원본 음성 신호를 획득할 수 있다. The first electronic device 120 may obtain the original voice signal of the first user 110. For example, the original voice signal of the first user 110 may be obtained through a microphone present in the first electronic device 120. For example, the first electronic device 120 may obtain an original voice signal by receiving information about the original voice recorded externally.

제1 전자 장치(120)는 원본 음성 신호를 비식별화할 수 있다. 원본 음성은 제1 사용자(110)의 억양, 제1 사용자(110)의 발음 또는 제1 사용자(110)의 톤을 포함하고 있다. 제1 전자 장치(120)는 원본 음성 신호에 포함된 개인 특성을 제거하는 비식별화를 통해 비식별화된 음성 신호를 생성할 수 있다. 제1 전자 장치(120)는 제2 전자 장치(130)에 대하여 비식별화된 음성 신호에 관한 정보를 전송할 수 있다. The first electronic device 120 may de-identify the original voice signal. The original voice includes the intonation of the first user 110, the pronunciation of the first user 110, or the tone of the first user 110. The first electronic device 120 may generate a de-identified voice signal through de-identification that removes personal characteristics included in the original voice signal. The first electronic device 120 may transmit information about a de-identified voice signal to the second electronic device 130.

제2 전자 장치(130)는 수신된 비식별화된 음성 신호를 스피커를 통해 제2 사용자(140)에게 출력할 수 있다. 제2 사용자(140)는 제1 사용자(110)의 음성을 동일하게 듣는 것이 아니라, 제1 사용자(110)의 음성 중에서 제1 사용자(110)의 특징이 제거되도록 처리된 비식별화된 음성 신호를 들을 수 있다. The second electronic device 130 may output the received de-identified voice signal to the second user 140 through a speaker. The second user 140 does not hear the same voice as the first user 110, but a de-identified voice signal that is processed to remove the characteristics of the first user 110 from the voice of the first user 110. You can hear.

도 2는 본 개시의 일 실시예에 따른, 음성 신호에 포함된 개인 특성을 설명하기 위한 도면이다.Figure 2 is a diagram for explaining personal characteristics included in a voice signal, according to an embodiment of the present disclosure.

도 2를 참조하면, 제1 사용자(210)는 음성 신호에 포함된 개인 특성을 이용하여 제2 사용자(220) 음성을 모방할 수 있다. Referring to FIG. 2, the first user 210 may imitate the voice of the second user 220 using personal characteristics included in the voice signal.

제1 사용자(210)는 제2 사용자(220)에게 전화 통화를 이용하여 제1 사용자(210)의 음성 신호 샘플을 획득할 수 있다. 예를 들면, 제1 사용자(210)는 "안녕하세요, 스티브씨 맞나요?"라는 질문을 하고, 제2 사용자(220)로부터 "네, 맞습니다. 누구신가요?"라는 답변을 얻을 수 있다. 제2 사용자(220)의 음성 신호 샘플은 제2 사용자(220)의 개인 특성을 포함할 수 있다. 예를 들면, 제2 사용자(220)의 개인 특성은 제2 사용자(220)의 억양, 제2 사용자(220)의 발음, 제2 사용자(220)의 톤을 포함할 수 있다.The first user 210 may obtain a voice signal sample of the second user 220 using a phone call. For example, the first user 210 may ask a question, “Hello, are you Steve?” and receive a response from the second user 220, “Yes, that’s right. Who are you?” The voice signal sample of the second user 220 may include personal characteristics of the second user 220 . For example, the personal characteristics of the second user 220 may include the accent of the second user 220, the pronunciation of the second user 220, and the tone of the second user 220.

제1 사용자(210)는 제2 사용자(220)의 개인 특성이 포함된 음성 신호 샘플을 음성 신호를 모방하는 딥 보이스 모델(Deep Voice model)에 학습시킴으로써, 제2 사용자(220)의 음성을 모방할 수 있다. 예를 들면, 제1 사용자(210)는 제2 사용자(220)의 음성을 모방하여 제2 사용자(220)의 친구인 제3 사용자(230)와 대화할 수 있다. 예를 들면, 제1 사용자(210)는 제2 사용자(220)의 모방된 음성을 이용하여, "안녕, 나 스티브야. 핸드폰 새로 샀어. 서버 비밀 번호 좀 알려줘"와 질문을 할 수 있다. 제3 사용자(230)는 제2 사용자(220)의 개인 특성을 갖는 음성 신호를 수신하므로, 통화의 상대방이 제2 사용자(220)라고 판단할 수 있다. The first user 210 imitates the voice of the second user 220 by training a voice signal sample containing the personal characteristics of the second user 220 in a deep voice model that imitates the voice signal. can do. For example, the first user 210 may imitate the voice of the second user 220 and converse with the third user 230, a friend of the second user 220. For example, the first user 210 may use the imitated voice of the second user 220 to ask a question such as "Hi, I'm Steve. I bought a new cell phone. Please tell me the server password." Since the third user 230 receives a voice signal having the personal characteristics of the second user 220, the third user 230 may determine that the other party of the call is the second user 220.

도 3은 본 개시의 일 실시예에 따른, 음성 신호가 전송되는 과정을 설명하기 위한 도면이다.Figure 3 is a diagram for explaining a process in which a voice signal is transmitted, according to an embodiment of the present disclosure.

도 3을 참조하면, 사용자(310)는 전자 장치(320)(예를 들면, 스마트 허브 등)에 대한 음성 명령을 할 수 있다. 예를 들면, 사용자(310)는 "내일 6시에 스티브랑 약속 저장해줘"와 같은 명령을 전자 장치(320)에 요청할 수 있다. Referring to FIG. 3, the user 310 may issue a voice command to the electronic device 320 (eg, smart hub, etc.). For example, the user 310 may request a command from the electronic device 320 such as “Save my appointment with Steve for 6 o’clock tomorrow.”

전자 장치(320)는 사용자(310)의 명령을 수행하기 위하여, 서버(330)로 사용자(310)의 명령이 포함된 음성 신호를 전송할 수 있다. 공격자(340)는 서버(330)로 전달되는 사용자(310)의 명령이 포함된 음성 신호를 획득할 수 있다. 예를 들면, 공격자(340)는 서버(330)를 해킹하거나, 서버(330)로 전달되는 통신 과정에서 정보를 스미싱할 수 있다.The electronic device 320 may transmit a voice signal containing the user's 310 command to the server 330 in order to carry out the user's 310 command. The attacker 340 may obtain a voice signal containing the user's 310 command transmitted to the server 330. For example, the attacker 340 may hack the server 330 or smish information during the communication process transmitted to the server 330.

공격자(340)가 획득한 음성 신호는 사용자(310)의 개인 특성이 포함될 수 있다. 공격자(340)는 획득된 음성 신호를 이용하여 사용자(310)의 음성을 모방할 수 있다. The voice signal obtained by the attacker 340 may include personal characteristics of the user 310. The attacker 340 may imitate the voice of the user 310 using the acquired voice signal.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법이 음성 신호의 개인 특성을 제거하는 방법은 도 4 내지 도 12를 참조하여 보다 상세하게 설명한다.According to an embodiment of the present disclosure, a method of removing personal characteristics of a voice signal by a voice signal processing method will be described in more detail with reference to FIGS. 4 to 12.

도 4는 본 개시의 일 실시예에 따른, 입력 음성 신호로부터 복원 음성 신호를 생성하는 과정을 설명하기 위한 도면이다.FIG. 4 is a diagram illustrating a process for generating a restored voice signal from an input voice signal according to an embodiment of the present disclosure.

도 4를 참조하면, 입력 음성 신호를 부호화 및 복호화함으로써, 복원 음성 신호를 생성할 수 있다. 입력 음성 신호는 사용자의 음성이 포함된 음성 신호일 수 있다. 구체적으로, 입력 음성 신호는 사용자의 개인 특성이 포함된 음성 신호일 수 있다.Referring to FIG. 4, a restored voice signal can be generated by encoding and decoding an input voice signal. The input voice signal may be a voice signal containing the user's voice. Specifically, the input voice signal may be a voice signal containing the user's personal characteristics.

본 개시의 일 실시예에 따라, 음성 신호 부호화부(410) 및 음성 신호 복호화부(440)는 후술되는 전자 장치(1500, 1600)에 의하여 수행될 수 있다. 음성 신호 부호화부(410) 및 음성 신호 복호화부(440)는 서로 다른 전자 장치(1500, 1600)에서 수행될 수도 있고, 하나의 전자 장치(1500, 1600)에서 수행될 수 있다.According to an embodiment of the present disclosure, the voice signal encoder 410 and the voice signal decoder 440 may be performed by electronic devices 1500 and 1600, which will be described later. The voice signal encoder 410 and the voice signal decoder 440 may be performed in different electronic devices 1500 and 1600, or may be performed in one electronic device 1500 and 1600.

음성 신호 부호화부(410)는 음성 특징 비식별화부(420) 및 음성 특징 부호화부(430)를 포함할 수 있다. 음성 신호 부호화부(410)는 입력 음성 신호를 획득하여, 입력 음성 신호에 대응되는 부호화 정보(비트스트림)를 생성할 수 있다.The voice signal encoder 410 may include a voice feature de-identification unit 420 and a voice feature encoder 430. The voice signal encoder 410 may obtain an input voice signal and generate encoded information (bitstream) corresponding to the input voice signal.

음성 특징 비식별화부(420)는 입력 음성 신호에 포함된 개인 특성을 제거하는 비식별화 과정을 수행할 수 있다. 본 개시의 일 실시예에 따라, 음성 특징 비식별화부(420)는 입력 음성 신호의 개인 특성을 제거하도록 학습된 인공 지능 모델일 수 있다. 본 개시의 일 실시예에 따른, 인공지능 모델의 학습 방법은 도 6a 및 도 6b를 참조하여 보다 상세하게 설명하도록 한다. 음성 특징 비식별화부(420)는 통계 기반 평균 음성 신호와 유사하도록 입력 음성 신호를 수정할 수 있다. 통계 기반 평균 음성 신호는 복수의 사용자들의 음성 신호 샘플에 기초하여 생성된 일반적인 음성 신호일 수 있다.The voice feature de-identification unit 420 may perform a de-identification process to remove personal characteristics included in the input voice signal. According to an embodiment of the present disclosure, the voice feature de-identification unit 420 may be an artificial intelligence model trained to remove personal characteristics of the input voice signal. The method of learning an artificial intelligence model according to an embodiment of the present disclosure will be described in more detail with reference to FIGS. 6A and 6B. The speech feature de-identification unit 420 may modify the input speech signal to be similar to a statistical-based average speech signal. The statistical-based average speech signal may be a general speech signal generated based on speech signal samples of a plurality of users.

음성 특징 비식별화부(420)는 입력 음성 신호를 비식별화함으로써 출력 음성 신호를 생성할 수 있다. 출력 음성 신호는 입력 음성 신호의 개인 특성이 제거되었고 동일한 대화 내용을 포함하는 음성 신호일 수 있다. The voice feature de-identification unit 420 may generate an output voice signal by de-identifying the input voice signal. The output voice signal may be a voice signal containing the same conversation content from which the personal characteristics of the input voice signal have been removed.

음성 특징 부호화부(430)는 출력 음성 신호를 부호화함으로써, 부호화된 음성 정보(비트스트림)를 생성할 수 있다. 본 개시의 일 실시예에 따른 음성 특징 부호화부(430)는 음성 특징을 추출하고, 추출된 음성 특징에 기초하여 부호화를 수행할 수 있다.The speech feature encoder 430 can generate encoded speech information (bitstream) by encoding the output speech signal. The speech feature encoder 430 according to an embodiment of the present disclosure may extract speech features and perform encoding based on the extracted speech features.

음성 특징 부호화부(430)는 입력 음성 신호를 부호화함으로써, 부호화된 음성 정보를 생성할 수 있다. 예를 들면, 음성 신호 비식별화 과정을 수행하지 않는다는 사용자 입력에 기초하여, 음성 특징 부호화부(430)는 입력 음성 신호를 부호화함으로써, 부호화된 음성 정보를 생성할 수 있다.The speech feature encoder 430 can generate encoded speech information by encoding the input speech signal. For example, based on a user input that a voice signal de-identification process is not performed, the voice feature encoder 430 may generate encoded voice information by encoding the input voice signal.

음성 신호 부호화부(410)는 비트스트림을 음성 신호 복호화부(440)에 전달할 수 있다. 예를 들면, 음성 신호 부호화부(410)를 포함하는 전자 장치는 통신부를 통해 비트스트림 정보를 음성 신호 복호화부(440)를 포함하는 전자 장치로 전송할 수 있다.The voice signal encoder 410 may transmit a bitstream to the voice signal decoder 440. For example, the electronic device including the voice signal encoder 410 may transmit bitstream information to the electronic device including the voice signal decoder 440 through the communication unit.

음성 신호 복호화부(440)는 획득된 비트스트림에 기초하여 복원 음성 신호를 생성할 수 있다. 음성 신호 복호화부(440)는 생성 모델(450)(Generative model)을 포함할 수 있다. 생성 모델(450)은 비트스트림에 기초하여 복원 음성 신호를 생성하도록 학습된 인공지능 모델일 수 있다. 복원 음성 신호는 입력 음성 신호와 동일한 내용을 포함하도록 학습된 것일 수 있다. 예를 들면, 복원 음성 신호는 입력 음성 신호에 포함된 개인 특성은 제거되었지만 동일한 내용을 나타내는 음성일 수 있다.The voice signal decoder 440 may generate a restored voice signal based on the obtained bitstream. The voice signal decoder 440 may include a generative model 450. The generation model 450 may be an artificial intelligence model learned to generate a restored voice signal based on the bitstream. The restored voice signal may be learned to contain the same content as the input voice signal. For example, the restored voice signal may be a voice that expresses the same content although personal characteristics included in the input voice signal have been removed.

도 5는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법의 순서도이다.Figure 5 is a flowchart of a voice signal processing method according to an embodiment of the present disclosure.

도 5를 참조하면, 음성 신호 처리 방법은 단계 510으로 시작될 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 전자 장치(1500, 1600)에 의하여 수행될 수 있다.Referring to FIG. 5, the voice signal processing method may begin at step 510. According to an embodiment of the present disclosure, the voice signal processing method may be performed by the electronic devices 1500 and 1600.

단계 510에서, 음성 신호 처리 방법은 입력 음성 신호를 획득하는 단계를 포함할 수 있다. 입력 음성 신호는 전자 장치의 마이크를 통해 획득되거나 다른 전자장치로부터 수신함으로써 획득될 수 있다.At step 510, the voice signal processing method may include obtaining an input voice signal. The input voice signal may be obtained through a microphone of the electronic device or may be obtained by receiving it from another electronic device.

단계 520에서, 음성 신호 처리 방법은 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 상기 부호화된 음성 정보에 기초하여 생성된 복원 음성 신호가 개인 특성을 포함하지 않도록 학습된 것일 수 있다. 예를 들면, 개인 특성은 사용자의 억양, 사용자의 발음 또는 사용자의 톤 중에서 적어도 하나를 포함할 수 있다.In step 520, the voice signal processing method may include generating an output voice signal in which personal characteristics of the input voice signal are filtered by inputting the input voice signal into an artificial intelligence model. According to an embodiment of the present disclosure, the artificial intelligence model may be trained so that the restored voice signal generated based on the encoded voice information does not include personal characteristics. For example, the personal characteristic may include at least one of the user's accent, the user's pronunciation, or the user's tone.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호와 타겟 음성 신호의 유사도에 관한 정보를 획득하는 단계를 더 포함할 수 있다. 여기서, 인공지능 모델은 복수의 인공지능 모델 중에서 유사도에 대응되는 인공지능 모델일 수 있다. 타겟 음성 신호는 통계 기반 평균 음성 신호를 의미할 수 있다.According to an embodiment of the present disclosure, the voice signal processing method may further include obtaining information about the similarity between the input voice signal and the target voice signal. Here, the artificial intelligence model may be an artificial intelligence model corresponding to similarity among a plurality of artificial intelligence models. The target voice signal may mean a statistical-based average voice signal.

단계 530에서, 음성 신호 처리 방법은 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. In step 530, the voice signal processing method may include generating encoded voice information by encoding an output voice signal.

단계 540에서, 음성 신호 처리 방법은 부호화된 음성 정보를 외부 전자장치로 전송하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 부호화된 음성 정보가 개인 특성이 필터링된 것임을 나타내는 정보를 외부 전자장치로 전송하는 단계를 더 포함할 수 있다.In step 540, the voice signal processing method may include transmitting encoded voice information to an external electronic device. According to an embodiment of the present disclosure, the voice signal processing method may further include transmitting information indicating that the encoded voice information has personal characteristics filtered to an external electronic device.

도 6a는 본 개시의 일 실시예에 따른, 인공지능 모델의 학습 방법을 설명하기 위한 도면이다.FIG. 6A is a diagram for explaining a method of learning an artificial intelligence model according to an embodiment of the present disclosure.

도 6a를 참조하면, 음성 특징 비식별화부(622a) 및 생성 모델(635b)은 인공지능 모델로써, 입력 음성 신호 및 복원 음성 신호에 기초하여 학습될 수 있다.Referring to FIG. 6A, the voice feature de-identification unit 622a and the generation model 635b are artificial intelligence models and can be learned based on the input voice signal and the restored voice signal.

본 개시의 일 실시예에 따른, 음성 신호 부호화부(620a), 음성 특징 비식별화부(622a), 음성 특징 부호화부(624a), 음성 신호 복호화부(630a) 및 생성 모델(635a)은 도 4의 음성 신호 부호화부(410), 음성 특징 비식별화부(420), 음성 특징 부호화부(430), 음성 신호 복호화부(440) 및 생성 모델(450)에 대응될 수 있다.According to an embodiment of the present disclosure, the voice signal encoder 620a, the voice feature de-identification unit 622a, the voice feature encoder 624a, the voice signal decoder 630a, and the generation model 635a are shown in FIG. 4 It may correspond to the voice signal encoder 410, voice feature de-identification unit 420, voice feature encoder 430, voice signal decoder 440, and generation model 450.

본 개시의 일 실시예에 따라, 입력 음성 신호(610a)는 소정의 데이터 셋에 의하여 획득될 수 있다. 음성 특징 비식별화부(622a)는 입력 음성 신호(610a)를 비식별화한 출력 음성 신호를 생성할 수 있다. 음성 특징 부호화부(624a)는 출력 음성 신호를 부호화할 수 있다. 생성 모델(635a)은 부호화된 출력 음성 신호로부터 복원 음성 신호(640a)를 생성할 수 있다.According to an embodiment of the present disclosure, the input voice signal 610a may be obtained by a predetermined data set. The voice feature de-identification unit 622a may generate an output voice signal by de-identifying the input voice signal 610a. The voice feature encoder 624a can encode the output voice signal. The generation model 635a may generate a reconstructed speech signal 640a from the encoded output speech signal.

손실 추정부(650a)는 입력 음성 신호(610a)와 복원 음성 신호(640a)에 기초하여, 손실을 추정할 수 있다. 예를 들면, 손실 추정부(650a)는 손실 함수 L = L1 + R 로 하여 손실을 계산할 수 있다. 여기서, L1은 입력 음성 신호(610a)와 복원 음성 신호(640a)의 복원 품질에 기초하여 추정된 손실이고, R은 정규화를 위한 값일 수 있다. 예를 들면, R은 L1 regulation 또는 L2 regulation 중 하나일 수 있다.The loss estimator 650a may estimate loss based on the input voice signal 610a and the restored voice signal 640a. For example, the loss estimation unit 650a may calculate the loss using the loss function L = L1 + R. Here, L1 is a loss estimated based on the restoration quality of the input voice signal 610a and the restored voice signal 640a, and R may be a value for normalization. For example, R may be either L1 regulation or L2 regulation.

손실 추정부(650a)는 L이 최소화 되도록 음성 특징 비식별화부(622a) 및 생성 모델(635a)의 파라미터를 업데이트할 수 있다. 파라미터를 반복적으로 업데이트 함으로써, 복원 음성 신호(640a)는 입력 음성 신호(610a)와 품질이 유사한 신호가 되도록 음성 특징 비식별화부(622a) 및 생성 모델(635a)이 학습될 수 있다.The loss estimation unit 650a may update the parameters of the speech feature de-identification unit 622a and the generation model 635a so that L is minimized. By repeatedly updating the parameters, the speech feature de-identification unit 622a and the generation model 635a can be learned so that the restored speech signal 640a has a similar quality to the input speech signal 610a.

도 6a의 학습 방법은 음성 신호 부호화부(620a), 음성 신호 복호화부(635a) 및 손실 추정부(650a)를 포함하는 하나의 전자 장치를 이용하여 수행될 수 있고, 음성 신호 부호화부(620a), 음성 신호 복호화부(635a) 및 손실 추정부(650a) 중 적어도 일부를 포함하는 서로 다른 전자 장치를 이용하여 수행될 수 있다.The learning method of FIG. 6A can be performed using a single electronic device including a voice signal encoder 620a, a voice signal decoder 635a, and a loss estimator 650a, and the voice signal encoder 620a , may be performed using different electronic devices including at least a portion of the voice signal decoder 635a and the loss estimation unit 650a.

도 6b는 본 개시의 일 실시예에 따른, 인공지능 모델의 학습 방법을 설명하기 위한 도면이다.FIG. 6B is a diagram for explaining a method of learning an artificial intelligence model according to an embodiment of the present disclosure.

도 6b를 참조하면, 음성 특징 비식별화부(622a)는 인공지능 모델로써, 입력 음성 신호, 및 복원 음성 신호에 기초하여 학습될 수 있다. Referring to FIG. 6B, the voice feature de-identification unit 622a is an artificial intelligence model and can be learned based on the input voice signal and the restored voice signal.

본 개시의 일 실시예에 따른, 음성 신호 부호화부(620b), 음성 특징 비식별화부(622b), 음성 특징 부호화부(624b), 음성 신호 복호화부(630b) 및 생성 모델(635b)은 도 4의 음성 신호 부호화부(410), 음성 특징 비식별화부(420), 음성 특징 부호화부(430), 음성 신호 복호화부(440) 및 생성 모델(450)에 대응될 수 있다.According to an embodiment of the present disclosure, the voice signal encoder 620b, the voice feature de-identification unit 622b, the voice feature encoder 624b, the voice signal decoder 630b, and the generation model 635b are shown in FIG. 4 It may correspond to the voice signal encoder 410, voice feature de-identification unit 420, voice feature encoder 430, voice signal decoder 440, and generation model 450.

본 개시의 일 실시예에 따른, 음성 특징 비식별화부(622b) 및 생성 모델(635b)은 도 6a의 학습 방법에 의하여 학습된 음성 특징 비식별화부(622a) 및 생성 모델(635a)을 이용하여 학습할 수 있다.According to an embodiment of the present disclosure, the voice feature de-identification unit 622b and the generation model 635b use the voice feature de-identification unit 622a and the generation model 635a learned by the learning method of FIG. 6A. You can learn.

입력 음성 신호(610b)로부터 복원 음성 신호(640b)가 생성되는 과정은 도 6a를 참조하여 이해될 수 있다. 설명의 편의를 위하여, 중복되는 내용은 생략될 수 있다.The process of generating the restored voice signal 640b from the input voice signal 610b can be understood with reference to FIG. 6A. For convenience of explanation, overlapping content may be omitted.

비식별화 품질 추정부(660b)는 원본 음성 신호(610b)와 복원 음성 신호(640b)를 비교하여, 비식별화 품질에 관한 손실을 추정할 수 있다. 개인 특성이 많이 제거될수록 적은 손실을 나타내고, 개인 특성이 적게 제거될수록 많은 손실을 나타낼 수 있다.The de-identification quality estimation unit 660b may compare the original voice signal 610b and the restored voice signal 640b to estimate loss regarding de-identification quality. The more personal characteristics are removed, the smaller the loss, and the less personal characteristics are removed, the greater the loss.

손실 추정부(650b)는 입력 음성 신호(610b)와 복원 음성 신호(640b)에 기초하여, 손실을 추정할 수 있다. 예를 들면, 손실 추정부(650a)는 손실 함수 L = L1 +L2 + R 로 하여 손실을 계산할 수 있다. 여기서, L1은 입력 음성 신호(610a)와 복원 음성 신호(640a)의 복원 품질에 기초하여 추정된 손실이고, L2는 비식별화 품질 추정부(660b)에 기초하여 추정된 손실이고, R은 정규화를 위한 값일 수 있다. 예를 들면, R은 L1 regulation 또는 L2 regulation 중 하나일 수 있다.The loss estimator 650b may estimate loss based on the input voice signal 610b and the restored voice signal 640b. For example, the loss estimator 650a may calculate the loss using the loss function L = L1 + L2 + R. Here, L1 is the loss estimated based on the restoration quality of the input voice signal 610a and the restored voice signal 640a, L2 is the loss estimated based on the de-identified quality estimation unit 660b, and R is the normalization It may be a value for . For example, R may be either L1 regulation or L2 regulation.

손실 추정부(650b)는 L이 최소화 되도록 음성 특징 비식별화부(622a) 의 파라미터를 업데이트할 수 있다. 파라미터를 반복적으로 업데이트 함으로써, 복원 음성 신호(640a)가 입력 음성 신호(610a)와 품질이 유사하고 비식별화된 신호가 되도록 음성 특징 비식별화부(622a)가 학습될 수 있다.The loss estimation unit 650b may update the parameters of the speech feature de-identification unit 622a so that L is minimized. By repeatedly updating the parameters, the voice feature de-identification unit 622a can be learned so that the restored voice signal 640a has a quality similar to that of the input voice signal 610a and becomes a de-identified signal.

도 6b의 학습 방법은 음성 신호 부호화부(620b), 음성 신호 복호화부(635b), 손실 추정부(650b) 및 비식별화 품질 추정부(660b)를 포함하는 하나의 전자 장치를 이용하여 수행될 수 있고, 음성 신호 부호화부(620b), 음성 신호 복호화부(635b), 손실 추정부(650b) 및 비식별화 품질 추정부(660b) 중 적어도 일부를 포함하는 서로 다른 전자 장치를 이용하여 수행될 수 있다.The learning method of FIG. 6b is performed using one electronic device including a voice signal encoder 620b, a voice signal decoder 635b, a loss estimation unit 650b, and a de-identification quality estimation unit 660b. may be performed using different electronic devices including at least a portion of the voice signal encoder 620b, the voice signal decoder 635b, the loss estimator 650b, and the de-identification quality estimator 660b. You can.

도 7은 본 개시의 일 실시예에 따른, 사용자의 개인 정보를 제거하는 음성 신호 처리 방법을 설명하기 위한 순서도이다.FIG. 7 is a flowchart illustrating a voice signal processing method for removing a user's personal information according to an embodiment of the present disclosure.

도 7를 참조하면, 음성 신호 처리 방법은 단계 710으로 시작될 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 전자 장치(1500, 1600)에 의하여 수행될 수 있다.Referring to FIG. 7, the voice signal processing method may begin at step 710. According to an embodiment of the present disclosure, the voice signal processing method may be performed by the electronic devices 1500 and 1600.

단계 710에서, 음성 신호 처리 방법은 입력 음성 신호에 포함된 사용자의 대화 내용을 식별하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따른, 사용자 대화를 식별하는 과정은 ASR(Automatic Speech Recognition)모델을 이용하여 수행될 수 있다.In step 710, the voice signal processing method may include identifying the user's conversation content included in the input voice signal. According to an embodiment of the present disclosure, the process of identifying user conversation may be performed using an Automatic Speech Recognition (ASR) model.

단계 720에서, 음성 신호 처리 방법은 사용자의 대화 내용에 개인 정보가 포함되어 있는지 여부를 식별하는 단계를 포함할 수 있다. 만약, 사용자의 대화 내용에 개인 정보가 포함되어 있으면 단계 730으로 진행한다. 만약, 사용자의 대화 내용에 개인 정보가 포함되어 있지 않으면 단계를 종료한다.In step 720, the voice signal processing method may include identifying whether personal information is included in the user's conversation content. If the user's conversation content includes personal information, the process proceeds to step 730. If the user's conversation content does not contain personal information, the step ends.

단계 730에서, 음성 신호 처리 방법은 입력 음성 신호에 포함된 상기 개인 정보가 제거되도록 상기 입력 음성 신호를 수정하는 단계를 포함할 수 있다. 예를 들면, 개인 정보가 포함된 대화 구간을 묵음 처리하거나 다른 음성으로 대체할 수 있다.In step 730, the voice signal processing method may include modifying the input voice signal so that the personal information included in the input voice signal is removed. For example, conversation sections containing personal information can be silenced or replaced with another voice.

본 개시의 일 실시예에 따른, 음성 신호 처리 방법이 개인 정보가 포함된 대화 내용을 수정함으로써, 사용자의 개인 정보가 유출되는 것을 방지할 수 있다.A voice signal processing method according to an embodiment of the present disclosure can prevent a user's personal information from being leaked by modifying conversation content including personal information.

도 8은 본 개시의 일 실시예에 따른, 비식별화된 음성 신호에 감정 정보를 추가하는 과정을 설명하기 위한 도면이다.FIG. 8 is a diagram illustrating a process of adding emotional information to a de-identified voice signal according to an embodiment of the present disclosure.

도 8을 참조하면, 제1 사용자(810)는 제1 전자 장치(820)를 이용하여 비식별화된 음성 신호를 생성할 수 있다. 제1 사용자(810)는 제1 전자 장치(820)와 제2 전자 장치(840)의 통신을 이용하여 비식별화된 음성 신호를 제2 사용자(850)에게 전달 할 수 있다.Referring to FIG. 8 , the first user 810 may generate a de-identified voice signal using the first electronic device 820. The first user 810 may transmit a de-identified voice signal to the second user 850 using communication between the first electronic device 820 and the second electronic device 840.

본 개시의 일 실시예에 따라 비식별화된 음성 신호는 감정이 없도록 생성될 수 있다. 예를 들면, 비식별화 처리를 위한 인공지능 모델이 감정이 식별되지 않는 음성 신호 샘플을 학습 데이터로 하여 학습한 것일 수 있다. 음성 신호 처리 방법은 비식별화된 음성 신호가 감정 상태를 포함하도록 음성 신호를 수정할 수 있다. 제1 전자 장치(820)는 복수의 감정 상태(830) 중에서 하나를 선택하는 사용자 입력을 수신할 수 있다. 제1 전자 장치(820)는 선택된 감정 상태에 대응되도록 비식별화된 음성 신호를 수정할 수 있다.According to an embodiment of the present disclosure, a de-identified voice signal may be generated without emotion. For example, an artificial intelligence model for de-identification processing may be learned using voice signal samples with unidentifiable emotions as learning data. The voice signal processing method can modify the de-identified voice signal so that it includes an emotional state. The first electronic device 820 may receive a user input for selecting one of a plurality of emotional states 830. The first electronic device 820 may modify the de-identified voice signal to correspond to the selected emotional state.

본 개시의 일 실시예에 따른 음성 처리 방법은 복수의 감정 상태에 대응되는 복수의 인공지능 모델을 이용하여 비식별화를 수행할 수 있다. 예를 들면, 비식별화 처리를 위한 복수의 인공지능 모델이 각각 감정 상태를 포함하는 음성 신호 샘플을 학습 데이터로 하여 학습한 것일 수 있다. 제1 전자 장치(820)는 복수의 감정 상태(830) 중에서 하나를 선택하는 사용자 입력을 수신할 수 있다. 제1 전자 장치(820)는 선택된 감정 상태에 대응되는 인공지능 모델을 이용하여 비식별화된 음성 신호를 생성할 수 있다. The voice processing method according to an embodiment of the present disclosure can perform de-identification using a plurality of artificial intelligence models corresponding to a plurality of emotional states. For example, a plurality of artificial intelligence models for de-identification processing may each be learned using voice signal samples including emotional states as learning data. The first electronic device 820 may receive a user input for selecting one of a plurality of emotional states 830. The first electronic device 820 may generate a de-identified voice signal using an artificial intelligence model corresponding to the selected emotional state.

예를 들면, 제1 사용자(810)가 복수의 감정 상태(830) 중에서 '행복'을 선택한 경우 행복한 감정 상태로 식별되는 복원 음성 신호가 제2 사용자(850)에게 전달될 수 있다.For example, when the first user 810 selects 'happiness' among the plurality of emotional states 830, a restored voice signal identified as a happy emotional state may be transmitted to the second user 850.

도 9는 본 개시의 일 실시예에 따른, 통신 채널의 종류에 기초하여 음성 신호를 수정하는 과정을 설명하기 위한 순서도이다.FIG. 9 is a flowchart illustrating a process for modifying a voice signal based on the type of a communication channel according to an embodiment of the present disclosure.

도 9를 참조하면, 음성 신호 처리 방법은 단계 910으로 시작될 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 전자 장치(1500, 1600)에 의하여 수행될 수 있다.Referring to FIG. 9, the voice signal processing method may begin at step 910. According to an embodiment of the present disclosure, the voice signal processing method may be performed by the electronic devices 1500 and 1600.

단계 910에서, 부호화된 음성 정보가 전달되는 채널을 식별하는 단계를 포함할 수 있다. 예를 들면, 채널은 무선 통신(예: 3G, 4G, 5G 등의 위성 통신, WiFi 등의 원거리 통신 또는 BLE 등의 근거리 통신 등) 또는 유선 통신(예: USB, Ethernet 등) 중 적어도 하나를 이용하여 통신이 이루어 질 수 있다. Step 910 may include identifying a channel through which the encoded voice information is transmitted. For example, the channel uses at least one of wireless communication (e.g., satellite communication such as 3G, 4G, 5G, long-distance communication such as WiFi, or short-distance communication such as BLE) or wired communication (e.g., USB, Ethernet, etc.) Thus, communication can take place.

단계 920에서, 채널의 타입에 기초하여, 상기 부호화된 음성 신호를 수정하는 단계를 포함할 수 있다. 예를 들면, 전자 장치는 각 채널의 종류에 따라, 음성 신호의 손실을 방지하기 위하여 음성 신호를 수정할 수 있다.Step 920 may include modifying the encoded voice signal based on the type of channel. For example, an electronic device can modify a voice signal to prevent loss of the voice signal depending on the type of each channel.

도 10은 본 개시의 일 실시예에 따른, 통화 중에 사용자가 음성 신호 비식별화를 설정하는 방법을 설명하기 위한 도면이다.FIG. 10 is a diagram illustrating a method for a user to set voice signal de-identification during a call, according to an embodiment of the present disclosure.

도 10을 참조하면, 사용자(1010, 1020)은 통화 중에 비식별화를 설정할지 여부를 결정할 수 있다. 예를 들면, 사용자(1010, 1020)는 상대방에게 전달되는 음성 신호에 대하여 비식별화를 수행하는지 여부를 선택할 수 있다.Referring to FIG. 10, users 1010 and 1020 can decide whether to set de-identification during a call. For example, users 1010 and 1020 can select whether to perform de-identification on voice signals transmitted to the other party.

전자 장치는 통화 중에 통화와 관련된 유저 인터페이스(1030)를 제공할 수 있다. 예를 들면, 유저 인터페이스(1030)는 통화 녹음 아이콘, 영상 통화 전환 아이콘, 블루투스 모드 아이콘, 스피커폰 설정 아이콘, 음소거 아이콘, 키패드 아이콘, 통화 종료 아이콘 또는 비식별화 모드 아이콘(1040) 중에서 적어도 일부를 포함할 수 있다. 유저 인터페이스(1030)에 포함된 아이콘의 배열은 제한되지 않고 여러가지 형태로 변경될 수 있다. 유저 인터페이스(1030)는 전자 장치의 화면 전체 또는 일부에 표시될 수 있다.The electronic device may provide a user interface 1030 related to the call during a call. For example, the user interface 1030 includes at least some of a call recording icon, a video call switching icon, a Bluetooth mode icon, a speakerphone settings icon, a mute icon, a keypad icon, a call end icon, or a de-identification mode icon 1040. can do. The arrangement of icons included in the user interface 1030 is not limited and can be changed into various forms. The user interface 1030 may be displayed on all or part of the screen of the electronic device.

사용자(1120)가 비식별화 모드 아이콘(1040)을 클릭하면 현재 설정된 비식별화 모드가 전환될 수 있다. 예를 들면, 전자 장치의 비식별화 모드가 활성화 되어 있는 경우, 사용자의 클릭에 의하여 비식별화 모드가 비활성화 될 수 있다. 반대로, 전자 장치의 비식별화 모드가 비활성화 되어 있는 경우, 사용자의 클릭에 의하여 비식별화 모드가 활성화 될 수 있다. 본 개시의 일 실시예에 따라, 비식별화 모드가 활성화 된 경우 비식별화 모드 아이콘(1040)이 초록색으로 표시될 수 있다. 비식별화 모드가 비활성화 된 경우 비식별화 모드 아이콘(1040)이 흰색으로 표시될 수 있다.When the user 1120 clicks the de-identification mode icon 1040, the currently set de-identification mode may be switched. For example, if the de-identification mode of the electronic device is activated, the de-identification mode may be deactivated by a user's click. Conversely, if the de-identification mode of the electronic device is disabled, the de-identification mode may be activated by a user's click. According to an embodiment of the present disclosure, when the de-identification mode is activated, the de-identification mode icon 1040 may be displayed in green. When de-identification mode is disabled, the de-identification mode icon 1040 may be displayed in white.

전자 장치의 비식별화 모드의 활성화 여부는 기 설정된 세팅에 따라 결정될 수 있다. 예를 들면, 사용자(1010, 1020)는 통화 시작하는 경우 비식별화 모드에 관한 활성화 여부를 미리 설정할 수 있다. 사용자(1010, 1020)는 통화 시작하는 경우 비식별화 모드에 관한 활성화 여부를 연락처에 등록된 사람인지 여부에 따라 다르게 설정할 수 있다.Whether or not to activate the de-identification mode of the electronic device may be determined according to preset settings. For example, users 1010 and 1020 can set in advance whether to activate the de-identification mode when starting a call. When starting a call, users 1010 and 1020 can set whether to activate the de-identification mode differently depending on whether the person is registered in the contact list.

도 11은 본 개시의 일 실시예에 따른, 통화 수신자가 음성 신호 비식별화를 설정하는 방법을 설명하기 위한 도면이다.FIG. 11 is a diagram illustrating a method for a call recipient to set de-identification of a voice signal, according to an embodiment of the present disclosure.

도 11을 참조하면, 통화 수신자(1120)가 통화 대기중에 비식별화 모드를 활성화할지 여부를 결정할 수 있다. 통화 수신자(1120)가 비식별화 모드를 활성화하는 경우, 통화 발신자(1110)는 통화 수신자의 비식별화 처리된 음성 신호를 수신할 수 있다.Referring to FIG. 11, the call recipient 1120 can determine whether to activate the de-identification mode while the call is waiting. When the call recipient 1120 activates the de-identification mode, the call originator 1110 can receive the call recipient's de-identified voice signal.

전자 장치는 통화 대기 중에 통화 시작과 관련된 유저 인터페이스(1130)를 제공할 수 있다. 예를 들면, 유저 인터페이스(1130)는 통화 시작 아이콘, 통화 종료 아이콘 또는 비식별화 모드 아이콘(1140) 중에서 적어도 일부를 포함할 수 있다. 유저 인터페이스(1130)에 포함된 아이콘의 배열은 제한되지 않고 여러가지 형태로 변경될 수 있다. 유저 인터페이스(1130)는 전자 장치의 화면 전체 또는 일부에 표시될 수 있다.The electronic device may provide a user interface 1130 related to starting a call while the call is waiting. For example, the user interface 1130 may include at least some of a call start icon, a call end icon, or a de-identification mode icon 1140. The arrangement of icons included in the user interface 1130 is not limited and can be changed into various forms. The user interface 1130 may be displayed on all or part of the screen of the electronic device.

사용자(1120)가 비식별화 모드 아이콘(1140)을 클릭하면 현재 설정된 비식별화 모드가 전환될 수 있다. 예를 들면, 전자 장치의 비식별화 모드가 활성화 되어 있는 경우, 사용자의 클릭에 의하여 비식별화 모드가 비활성화 될 수 있다. 반대로, 전자 장치의 비식별화 모드가 비활성화 되어 있는 경우, 사용자의 클릭에 의하여 비식별화 모드가 활성화 될 수 있다. 본 개시의 일 실시예에 따라, 비식별화 모드가 활성화 된 경우 비식별화 모드 아이콘(1140)이 초록색으로 표시될 수 있다. 비식별화 모드가 비활성화 된 경우 비식별화 모드 아이콘(1140)이 흰색으로 표시될 수 있다.When the user 1120 clicks the de-identification mode icon 1140, the currently set de-identification mode may be switched. For example, if the de-identification mode of the electronic device is activated, the de-identification mode may be deactivated by a user's click. Conversely, if the de-identification mode of the electronic device is disabled, the de-identification mode may be activated by a user's click. According to an embodiment of the present disclosure, when the de-identification mode is activated, the de-identification mode icon 1140 may be displayed in green. When de-identification mode is disabled, the de-identification mode icon 1140 may be displayed in white.

전자 장치의 비식별화 모드의 활성화 여부는 기 설정된 세팅에 따라 결정될 수 있다. 예를 들면, 사용자(1120)는 비식별화 모드에 관한 활성화 여부를 미리 설정할 수 있다. 사용자(1120)는 비식별화 모드에 관한 활성화 여부를 연락처에 등록된 사람인지 여부에 따라 다르게 설정할 수 있다.Whether or not to activate the de-identification mode of the electronic device may be determined according to preset settings. For example, the user 1120 can set in advance whether to activate the de-identification mode. The user 1120 can set whether to activate the de-identification mode differently depending on whether the person is registered in the contact list.

도 12는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법의 순서도이다.Figure 12 is a flowchart of a voice signal processing method according to an embodiment of the present disclosure.

도 12를 참조하면, 음성 신호 처리 방법은 단계 1210으로 시작될 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 전자 장치(1500, 1600)에 의하여 수행될 수 있다.Referring to FIG. 12, the voice signal processing method may begin at step 1210. According to an embodiment of the present disclosure, the voice signal processing method may be performed by the electronic devices 1500 and 1600.

단계 1210에서, 음성 신호 처리 방법은 음성 신호의 개인 특성을 제거하는지 여부를 식별하는 단계를 포함 할 수 있다. 만약, 음성 신호의 개인 특성을 제거하는 것으로 식별되면, 단계 1220으로 진행한다. 그렇지 않다면, 단계 1240으로 진행한다.At step 1210, the method of processing a voice signal may include identifying whether to remove personal characteristics of the voice signal. If it is identified that personal characteristics of the voice signal are to be removed, proceed to step 1220. If not, proceed to step 1240.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 음성 신호의 상기 개인 특성을 제거하는지 여부에 관한 사용자 입력을 획득함으로써 음성 신호의 개인 특성을 제거하는지 여부를 식별할 수 있다. 본 개시의 일 실시예에 따른, 사용자 입력은 도 10 및 도 11과 같은 유저 인터페이스를 통해 획득될 수 있다.According to an embodiment of the present disclosure, a voice signal processing method may identify whether to remove personal characteristics of a voice signal by obtaining user input regarding whether to remove the personal characteristics of the voice signal. According to an embodiment of the present disclosure, user input may be obtained through a user interface such as FIGS. 10 and 11.

단계 1220에서, 음성 신호 처리 방법은 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성하는 단계를 포함할 수 있다. 단계 1220은 도 5의 단계 520에 대응될 수 있으나, 이에 제한되지 않는다.In step 1220, the voice signal processing method may include generating an output voice signal in which personal characteristics of the input voice signal are filtered by inputting the input voice signal into an artificial intelligence model. Step 1220 may correspond to step 520 of FIG. 5, but is not limited thereto.

단계 1230에서, 음성 신호 처리 방법은 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. 단계 1230은 도 5의 단계 530에 대응될 수 있으나, 이에 제한되지 않는다.In step 1230, the voice signal processing method may include generating encoded voice information by encoding an output voice signal. Step 1230 may correspond to step 530 of FIG. 5, but is not limited thereto.

단계 1240에서, 음성 신호 처리 방법은 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. In step 1240, the voice signal processing method may include generating encoded voice information by encoding an output voice signal.

도 13은 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 이용한 보안 과정을 설명하기 위한 도면이다.Figure 13 is a diagram for explaining a security process using a voice signal processing method according to an embodiment of the present disclosure.

도 13을 참조하면, 사용자(1310)는 비식별화된 음성 신호를 외부 전자 장치(1330)로 전달할 수 있다. Referring to FIG. 13, the user 1310 can transmit a de-identified voice signal to the external electronic device 1330.

사용자(1310)는 다른 전자 장치에 대한 명령을 수행하도록 제어하는 전자 장치(1320)에 대하여 명령을 전달할 수 있다. 예를 들면, 사용자(1310)는 "하이 빅스비, 삼성 뮤직으로 음악 재생해줘"라고 명령할 수 있다. The user 1310 may transmit a command to the electronic device 1320 that it controls to execute a command to another electronic device. For example, the user 1310 may command, “Hi Bixby, play music using Samsung Music.”

전자 장치(1320)는 사용자(1310)의 명령을 수행하기 위하여, 음성 신호를 외부 전자 장치(1330)로 전송할 수 있다. 본 개시의 일 실시예에 따라, 전자 장치(1320)는 사용자의 음성으로부터 획득된 원본 음성 신호를 대신하여 비식별화된 음성 신호를 외부 전자 장치(1330)로 전달할 수 있다.The electronic device 1320 may transmit a voice signal to the external electronic device 1330 in order to execute the user's 1310 command. According to an embodiment of the present disclosure, the electronic device 1320 may transmit a de-identified voice signal to the external electronic device 1330 instead of the original voice signal obtained from the user's voice.

본 개시의 일 실시예에 따라 전자 장치가 비식별화 음성 신호를 전달하는 경우, 외부에서 음성 신호를 스미싱하거나 외부 전자 장치(1330)가 해킹된 경우에도 사용자 음성의 개인 특성에 관한 정보는 유출되지 않을 수 있다.According to an embodiment of the present disclosure, when an electronic device transmits a de-identified voice signal, information about the personal characteristics of the user's voice is leaked even when the voice signal is smished from the outside or the external electronic device 1330 is hacked. It may not work.

도 14는 본 개시의 일 실시예에 따른, 음성 신호 처리 방법을 이용한 보안 과정을 설명하기 위한 도면이다.Figure 14 is a diagram for explaining a security process using a voice signal processing method according to an embodiment of the present disclosure.

도 14를 참조하면, 음성 신호 처리 방법을 이용하여 다양한 상황에서 보다 효율적으로 통신을 수행할 수 있다.Referring to FIG. 14, communication can be performed more efficiently in various situations using a voice signal processing method.

다양한 상황에서 사용자는 자신의 감정을 배제하고 상대방과 통화하는 것을 원할 수 있다. 사용자는 비식별화 처리된 음성으로 감정 상태가 배제된 음성 신호를 상대방에게 전달할 수 있다. 혹은, 사용자는 특정 감정 상태로 비식별화 처리된 음성 신호를 상대방에게 전달할 수 있다.In various situations, users may want to ignore their emotions and talk to the other person. The user can transmit a voice signal without emotional state to the other party using a de-identified voice. Alternatively, the user can transmit a de-identified voice signal with a specific emotional state to the other party.

예를 들면, 사용자는 동료들과 화상 통화를 하는 상황(1410)일 수 있다. 사용자는 비식별화 모드를 이용하여 자신의 감정 상태를 배제하거나 자신의 감정 상태를 대신하여 다른 감정 상태의 음성으로 동료와 통화할 수 있다.For example, the user may be in a situation 1410 making a video call with colleagues. Using the de-identification mode, users can exclude their own emotional state or use a voice in a different emotional state instead of their own emotional state to talk to a colleague.

예를 들면, 사용자는 재택 근무하고 있는 상황(1420)일 수 있다. 사용자는 주변의 다양한 환경에 대하여 감정이 변화될 수 있다. 이러한 환경에서 사용자는 비식별화 모드를 이용하여 자신의 감정 상태를 배제하거나 자신의 감정 상태를 대신하여 다른 감정 상태의 음성을 사용할 수 있다.For example, the user may be working from home (1420). Users' emotions may change regarding various surrounding environments. In this environment, the user can use the de-identification mode to exclude his or her emotional state or use a voice of another emotional state in place of his or her emotional state.

예를 들면, 사용자는 생명이 위급한 환자와 같이 있는 상황(1430)일 수 있다. 혹은, 사용자는 특정 업무(예: 운전)에 집중하고 있는 상황(1440)일 수 있다. 사용자는 긴급함을 느끼고, 다급한 목소리로 대화할 수 있으나, 비식별화 모드를 이용하여 감정 상태를 배제하거나 자신의 감정 상태를 대신하여 다른 감정 상태의 음성으로 대화할 수 있다.For example, the user may be in a situation 1430 with a patient whose life is in danger. Alternatively, the user may be concentrating on a specific task (e.g., driving) (1440). The user may feel a sense of urgency and speak in an urgent voice, but by using the de-identification mode, the user may exclude the emotional state or speak with a voice in a different emotional state instead of the user's emotional state.

예를 들면, 사용자는 다양한 사람에 대한 상담을 수행(예: 콜센터 직원)하는 상황(1450)일 수 있다. 사용자는 비식별화 모드를 이용하여 자신의 감정을 배제하고 업무를 수행할 수 있다.For example, the user may be in a situation 1450 performing consultations with various people (e.g., call center employees). Users can use the de-identification mode to perform work without their emotions.

도 15는 본 개시의 일 실시예에 따른, 음성 신호를 처리하기 위한 전자 장치의 블록 구성도이다.Figure 15 is a block diagram of an electronic device for processing voice signals, according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 전자 장치(1500)는 음성 신호 처리를 수행할 수 있는 전자 장치로, 구체적으로 음성 신호 비식별화 처리를 수행할 수 있는 전자 장치일 수 있다.The electronic device 1500 according to an embodiment of the present disclosure may be an electronic device capable of performing voice signal processing, and specifically, may be an electronic device capable of performing voice signal de-identification processing.

본 개시의 일 실시예에 따른 전자 장치(1500)는 메모리(1510) 및 프로세서(1520)를 포함할 수 있다. 이하 상기 구성요소들에 대해 차례로 살펴본다.The electronic device 1500 according to an embodiment of the present disclosure may include a memory 1510 and a processor 1520. Below, we look at the above components in turn.

메모리(1510)는 프로세서(1520)의 처리 및 제어를 위한 프로그램을 저장할 수도 있다. 본 개시의 일 실시예에 따른 메모리(1510)는 하나 이상의 인스트럭션을 저장할 수 있다.The memory 1510 may store programs for processing and control of the processor 1520. The memory 1510 according to an embodiment of the present disclosure may store one or more instructions.

프로세서(1520)는 전자 장치(1500)의 전반적인 동작을 제어할 수 있고, 메모리(1510)에 저장된 하나 이상의 인스트럭션을 실행하여 전자 장치(1500)의 동작을 제어할 수 있다.The processor 1520 can control the overall operation of the electronic device 1500 and can control the operation of the electronic device 1500 by executing one or more instructions stored in the memory 1510.

본 개시의 일 실시예에 따른 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 획득할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보를 외부 전자장치로 전송할 수 있다. The processor 1520 according to an embodiment of the present disclosure may acquire an input voice signal by executing at least one instruction. The processor 1520 may input an input voice signal into an artificial intelligence model by executing at least one instruction, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. The processor 1520 may generate encoded voice information by executing at least one instruction to encode an output voice signal. The processor 1520 may transmit encoded voice information to an external electronic device by executing at least one instruction.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호에 포함된 사용자의 대화 내용을 식별할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 사용자의 대화 내용에 개인 정보가 포함되어 있는지 여부를 식별할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호에 포함된 개인 정보가 제거되도록 입력 음성 신호를 수정할 수 있다.According to an embodiment of the present disclosure, the processor 1520 may identify the content of the user's conversation included in the input voice signal by executing at least one instruction. The processor 1520 can identify whether personal information is included in the user's conversation content by executing at least one instruction. The processor 1520 may modify the input voice signal so that personal information included in the input voice signal is removed by executing at least one instruction.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 복수의 감정 상태들 중에서 하나를 선택하는 사용자 입력을 획득할 수 있다.According to an embodiment of the present disclosure, the processor 1520 may obtain a user input for selecting one of a plurality of emotional states by executing at least one instruction.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보가 전달되는 채널을 식별할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 채널의 타입에 기초하여, 부호화된 음성 신호를 수정할 수 있다. According to an embodiment of the present disclosure, the processor 1520 can identify a channel through which encoded voice information is transmitted by executing at least one instruction. The processor 1520 may modify the encoded voice signal based on the type of channel by executing at least one instruction.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 음성 신호의 개인 특성을 제거하는지 여부에 관한 사용자 입력을 획득할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 개인 특성을 제거하는 사용자 입력에 기초하여, 입력 음성 신호를 인공지능 모델에 입력함으로써, 출력 음성 신호를 생성할 수 있다.According to one embodiment of the present disclosure, the processor 1520 may obtain user input regarding whether to remove personal characteristics of the voice signal by executing at least one instruction. The processor 1520 may generate an output voice signal by executing at least one instruction and inputting the input voice signal into an artificial intelligence model based on user input with personal characteristics removed.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호의 음성 특징을 추출할 수 있다. 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 음성 특징을 부호화함으로써 부호화된 음성 정보를 생성할 수 있다. According to an embodiment of the present disclosure, the processor 1520 may extract speech features of the output speech signal by executing at least one instruction. The processor 1520 may generate encoded voice information by executing at least one instruction and encoding voice features.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보가 개인 특성이 필터링된 것임을 나타내는 정보를 외부 전자장치로 전송할 수 있다.According to an embodiment of the present disclosure, the processor 1520 may transmit information indicating that the encoded voice information has personal characteristics filtered to an external electronic device by executing at least one instruction.

본 개시의 일 실시예에 따라, 프로세서(1520)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호와 타겟 음성 신호의 유사도에 관한 정보를 획득할 수 있다. 타겟 음성 신호는 복수의 사용자로부터 획득된 음성 신호들에 기초하여 생성된 평균 음성 신호일 수 있다.According to an embodiment of the present disclosure, the processor 1520 may obtain information regarding the similarity between the input voice signal and the target voice signal by executing at least one instruction. The target voice signal may be an average voice signal generated based on voice signals obtained from a plurality of users.

그러나 도시된 구성요소 모두가 필수구성요소인 것은 아니다. 도시된 구성요소보다 많은 구성요소에 의해 전자 장치(1500)가 구현될 수도 있고, 그보다 적은 구성요소에 의해서도 전자 장치(1500)가 구현될 수 있다. 예를 들어, 도 16에 도시된 바와 같이, 본 개시의 일 실시예에 따른 전자 장치(1600)는 메모리(1610), 프로세서(1620), 수신부(1630), 출력부(1640), 통신부(1650), 사용자 인터페이스부(1660) 및 외부기기 인터페이스부(1670)를 포함할 수 있으나, 이에 제한되지 않는다.However, not all of the illustrated components are essential components. The electronic device 1500 may be implemented with more components than the components shown, or may be implemented with fewer components. For example, as shown in FIG. 16, the electronic device 1600 according to an embodiment of the present disclosure includes a memory 1610, a processor 1620, a receiver 1630, an output unit 1640, and a communication unit 1650. ), a user interface unit 1660, and an external device interface unit 1670, but is not limited thereto.

도 16은 본 개시의 일 실시예에 따른, 음성 신호를 처리하기 위한 전자 장치의 블록 구성도이다.Figure 16 is a block diagram of an electronic device for processing voice signals, according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 전자 장치(1600)는 음성 신호 처리를 수행할 수 있는 전자 장치일 수 있다. 전자 장치는 휴대폰, 태블릿 PC, PDA, MP3 플레이어, 키오스크, 전자 액자, 네비게이션 장치, 디지털 TV, 사물 인터넷(IoT; Internet of Things) 장치, 청소 로봇, 소셜 로봇, 음성 어시스턴스, 손목 시계(Wrist watch) 또는 HMD(Head-Mounted Display)와 같은 웨어러블 기기(Wearable device) 등과 같은 사용자가 사용할 수 있는 다양한 유형의 장치를 포함할 수 있다. 전자 장치(1600)는 하기의 도 1 내지 도 15의 전자 장치와 대응될 수 있으나, 이에 제한되는 것은 아니다.The electronic device 1600 according to an embodiment of the present disclosure may be an electronic device capable of performing voice signal processing. Electronic devices include mobile phones, tablet PCs, PDAs, MP3 players, kiosks, electronic picture frames, navigation devices, digital TVs, Internet of Things (IoT) devices, cleaning robots, social robots, voice assistants, and wrist watches. ) or wearable devices such as HMD (Head-Mounted Display), etc. may include various types of devices that can be used by users. The electronic device 1600 may correspond to the electronic device shown in FIGS. 1 to 15 below, but is not limited thereto.

또한, 전자 장치(1600)는 메모리(1610), 프로세서(1620) 이외에 수신부(1630), 출력부(1640), 통신부(1650), 사용자 인터페이스부(1660), 외부기기 인터페이스부(1670) 및 전원 공급부(미도시)를 더 포함할 수 있다. 이하 상기 구성요소들에 대해 차례로 살펴본다. In addition, the electronic device 1600 includes, in addition to a memory 1610 and a processor 1620, a receiving unit 1630, an output unit 1640, a communication unit 1650, a user interface unit 1660, an external device interface unit 1670, and a power source. It may further include a supply department (not shown). Below, we look at the above components in turn.

메모리(1610)는 프로세서(1620)의 처리 및 제어를 위한 프로그램을 저장할 수도 있다. 본 개시의 일 실시예에 따른 메모리(1610)는 하나 이상의 인스트럭션을 저장할 수 있다고, 내장 메모리(Internal Memory)(미도시) 및 외장 메모리(External Memory)(미도시) 중 적어도 하나를 포함할 수 있다. 메모리(1610)는 디바이스(1600)의 동작에 사용되는 각종 프로그램 및 데이터를 저장할 수 있다. 예를 들어, 메모리(1610)는 비식별화를 수행하기 위한 인공지능 모델이 저장될 수 있다. 예를 들어, 메모리(1610)는 음성 신호를 복호화하기 위한 생성 모델(Generative model)을 포함할 수 있다. The memory 1610 may store programs for processing and control of the processor 1620. The memory 1610 according to an embodiment of the present disclosure may store one or more instructions and may include at least one of internal memory (not shown) and external memory (not shown). . The memory 1610 can store various programs and data used in the operation of the device 1600. For example, the memory 1610 may store an artificial intelligence model for performing de-identification. For example, the memory 1610 may include a generative model for decoding a voice signal.

내장 메모리는, 예를 들어, 휘발성 메모리(예를 들면, DRAM(Dynamic RAM), SRAM(Static RAM), SDRAM(Synchronous Dynamic RAM) 등), 비휘발성 메모리(예를 들면, OTPROM(One Time Programmable ROM), PROM(Programmable ROM), EPROM(Erasable and Programmable ROM), EEPROM(Electrically Erasable and Programmable ROM), Mask ROM, Flash ROM 등), 하드 디스크 드라이브(HDD) 또는 솔리드 스테이트 드라이브(SSD) 중 적어도 하나를 포함할 수 있다. 일 실시예에 따르면, 제어부(3470)는 비휘발성 메모리 또는 다른 구성요소 중 적어도 하나로부터 수신한 명령 또는 데이터를 휘발성 메모리에 로드(load)하여 처리할 수 있다. 또한, 제어부(1620)는 다른 구성요소로부터 수신하거나 생성된 데이터를 비휘발성 메모리에 보존할 수 있다.Built-in memory includes, for example, volatile memory (e.g., DRAM (Dynamic RAM), SRAM (Static RAM), SDRAM (Synchronous Dynamic RAM), etc.), non-volatile memory (e.g., OTPROM (One Time Programmable ROM), etc. ), PROM (Programmable ROM), EPROM (Erasable and Programmable ROM), EEPROM (Electrically Erasable and Programmable ROM), Mask ROM, Flash ROM, etc.), hard disk drive (HDD), or solid state drive (SSD). It can be included. According to one embodiment, the control unit 3470 may load commands or data received from at least one of the non-volatile memory or other components into the volatile memory and process them. Additionally, the control unit 1620 may store data received or generated from other components in non-volatile memory.

외장 메모리는, 예를 들면, CF(Compact Flash), SD(Secure Digital), Micro-SD(Micro Secure Digital), Mini-SD(Mini Secure Digital), xD(extreme Digital) 및 Memory Stick 중 적어도 하나를 포함할 수 있다.External memory includes, for example, at least one of CF (Compact Flash), SD (Secure Digital), Micro-SD (Micro Secure Digital), Mini-SD (Mini Secure Digital), xD (extreme Digital), and Memory Stick. It can be included.

프로세서(1620)는 전자 장치(1600)의 전반적인 동작을 제어할 수 있고, 메모리(1610)에 저장된 하나 이상의 인스트럭션을 실행하여 전자 장치(1600)의 동작을 제어할 수 있다. 예를 들어, 프로세서(1620)는, 메모리(1610)에 저장된 프로그램들을 실행함으로써, 메모리(1610), 수신부(1630), 출력부(1640), 통신부(1650), 사용자 인터페이스부(1660) 및 외부기기 인터페이스부(1670) 및 전원 공급부(미도시)등을 전반적으로 제어할 수 있다.The processor 1620 can control the overall operation of the electronic device 1600 and can control the operation of the electronic device 1600 by executing one or more instructions stored in the memory 1610. For example, the processor 1620 executes programs stored in the memory 1610, so that the memory 1610, the receiver 1630, the output unit 1640, the communication unit 1650, the user interface unit 1660, and the external The device interface unit 1670 and the power supply unit (not shown) can be generally controlled.

프로세서(1620)는 RAM, ROM, CPU, GPU 및 버스 중 적어도 하나를 포함할 수 있다. RAM, ROM, CPU 및 GPU 등은 버스를 통해 서로 연결될 수 있다. 본 개시의 일 실시예에 의하면, 프로세서(1620)는 비식별화를 수행하기 위한 인공지능 모델을 생성하기 위한 AI 프로세서를 포함할 수 있으나, 이에 한정되는 것은 아니다. 본 개시의 일 실시예에 의하면, AI 프로세서는 프로세서(1620)와 별도의 칩으로 구현될 수도 있다. 본 개시의 일 실시예에 의하면, AI 프로세서는 범용 칩일 수도 있다.The processor 1620 may include at least one of RAM, ROM, CPU, GPU, and bus. RAM, ROM, CPU and GPU, etc. can be connected to each other through a bus. According to one embodiment of the present disclosure, the processor 1620 may include an AI processor for generating an artificial intelligence model for performing de-identification, but is not limited thereto. According to an embodiment of the present disclosure, the AI processor may be implemented as a separate chip from the processor 1620. According to one embodiment of the present disclosure, the AI processor may be a general-purpose chip.

본 개시의 일 실시예에 따른 프로세서(1620)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 획득할 수 있다. 프로세서(1620)는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성할 수 있다. 프로세서(1620)는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성할 수 있다. 프로세서(1620)는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보를 외부 전자장치로 전송할 수 있다. 다만, 프로세서(1620)에서 수행되는 각각의 동작은 별도의 서버(미도시)를 통해 수행될 수도 있다. 서버는 클라우드 기반의 서버를 나타낼 수도 있으나, 이에 한정되는 것은 아니다. The processor 1620 according to an embodiment of the present disclosure may acquire an input voice signal by executing at least one instruction. The processor 1620 may input an input voice signal into an artificial intelligence model by executing at least one instruction, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. The processor 1620 may generate encoded voice information by executing at least one instruction to encode an output voice signal. The processor 1620 may transmit encoded voice information to an external electronic device by executing at least one instruction. However, each operation performed by the processor 1620 may be performed through a separate server (not shown). The server may refer to a cloud-based server, but is not limited thereto.

수신부(1630)는 전자 장치(1600) 자체에 내장되어 있거나 외부에 배치된 마이크로폰을 포함할 수 있고, 마이크부는 하나 이상의 마이크로폰 포함할 수 있다. 구체적으로, 프로세서(1620)는 수신부(1630)를 통해 사용자의 아날로그 음성 신호를 수신하도록 제어할 수 있다. 또한, 프로세서(1620)는 수신부(1630)를 통해 입력되는 사용자의 음성 신호를 획득할 수 있다. 전자 장치(1600)가 수신부(1630)를 통해 수신한 음성 신호는 디지털화되어 전자 장치(1600)의 프로세서(1620)로 송신될 수도 있다.The receiving unit 1630 may include a microphone built into the electronic device 1600 itself or placed externally, and the microphone unit may include one or more microphones. Specifically, the processor 1620 can control the reception unit 1630 to receive the user's analog voice signal. Additionally, the processor 1620 may obtain the user's voice signal input through the receiver 1630. The voice signal received by the electronic device 1600 through the receiver 1630 may be digitized and transmitted to the processor 1620 of the electronic device 1600.

다만, 마이크로폰을 포함하는 별도의 외부 전자 장치 또는 마이크로폰을 포함하는 휴대용 단말을 통해 음성 신호가 수신될 수도 있다. 이 경우, 전자 장치(1600)는 수신부(1630)를 포함하지 않을 수 있다. 구체적으로, 외부 전자 장치 또는 휴대용 단말을 통해 수신된 아날로그 음성 신호는 디지털화되어 통신(예: 블루투스) 등을 통해 전자 장치(1600)로 수신될 수도 있으나, 이에 제한되는 것은 아니다. However, voice signals may be received through a separate external electronic device that includes a microphone or a portable terminal that includes a microphone. In this case, the electronic device 1600 may not include the receiving unit 1630. Specifically, an analog voice signal received through an external electronic device or portable terminal may be digitized and received by the electronic device 1600 through communication (eg, Bluetooth), but is not limited thereto.

출력부(1640)는 디스플레이부(1641) 및 오디오 출력부(1642) 중 적어도 하나를 포함할 수 있다.The output unit 1640 may include at least one of a display unit 1641 and an audio output unit 1642.

디스플레이부(1641)는 표시패널 및 표시 패널을 제어하는 컨트롤러(미도시)를 포함할 수 있고, 디스플레이부(1641)는 전자 장치(1600)에 내장된 디스플레이를 나타낼 수 있다. 표시패널에는 LCD(Liquid Crystal Display), OLED(Organic Light Emitting Diodes) 디스플레이, AM-OLED(Active-Matrix Organic Light-Emitting Diode), PDP(Plasma Display Panel) 등과 같은 다양한 형태의 디스플레이로 구현될 수 있다. 표시패널은 유연하게(flexible), 투명하게(transparent) 또는 착용할 수 있게(wearable) 구현될 수 있다. 디스플레이부(1641)는 사용자 인터페이스 (1660)의 터치 패널과 결합되어 터치 스크린으로 제공될 수 있다. 예를 들어, 터치 스크린은 표시 패널과 터치 패널이 적층 구조로 결합된 일체형의 모듈을 포함할 수 있다.The display unit 1641 may include a display panel and a controller (not shown) that controls the display panel, and the display unit 1641 may represent a display built into the electronic device 1600. The display panel can be implemented in various types of displays, such as LCD (Liquid Crystal Display), OLED (Organic Light Emitting Diode) display, AM-OLED (Active-Matrix Organic Light-Emitting Diode), and PDP (Plasma Display Panel). . The display panel may be implemented as flexible, transparent, or wearable. The display unit 1641 may be combined with the touch panel of the user interface 1660 to provide a touch screen. For example, a touch screen may include an integrated module in which a display panel and a touch panel are combined in a laminated structure.

본 개시의 일 실시예에 의한 디스플레이부(1641)는 프로세서(1620)의 제어에 따라 사용자로 입력을 획득하기 위한 유저 인터페이스(User Interface)를 출력할 수도 있다. The display unit 1641 according to an embodiment of the present disclosure may output a user interface (User Interface) for obtaining input from the user under the control of the processor 1620.

오디오 출력부(1642)는 적어도 하나의 스피커로 구성된 출력부일 수 있다. 일부 실시예에 의한 프로세서(1620)는 오디오 출력부(1642)를 통해 복호화된 음성 신호를 출력하도록 제어할 수 있다.The audio output unit 1642 may be an output unit consisting of at least one speaker. The processor 1620 according to some embodiments may control output of the decoded voice signal through the audio output unit 1642.

통신부(1650)는 전자 장치(1600)와 전자 장치(1600) 주변에 위치한 복수의 디바이스들 간의 통신을 하게 하는 하나 이상의 구성요소를 포함할 수 있다. 통신부(1650)는 전자 장치(1600)와 서버 간의 통신을 하게 하는 하나 이상의 구성요소를 포함할 수 있다. 구체적으로, 통신부(1650)는 다양한 유형의 통신방식에 따라 다양한 유형의 외부 기기 또는 서버와 통신을 수행할 수 있다. 또한, 통신부(1650)는 근거리 통신부을 포함할 수 있다.The communication unit 1650 may include one or more components that enable communication between the electronic device 1600 and a plurality of devices located around the electronic device 1600. The communication unit 1650 may include one or more components that enable communication between the electronic device 1600 and a server. Specifically, the communication unit 1650 can communicate with various types of external devices or servers according to various types of communication methods. Additionally, the communication unit 1650 may include a short-range communication unit.

근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(와이파이 Direct) 통신부, UWB(Ultra Wideband) 통신부, Ant+ 통신부 이더넷 통신부 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The short-range wireless communication unit includes a Bluetooth communication unit, BLE (Bluetooth Low Energy) communication unit, Near Field Communication unit, WLAN (Wi-Fi) communication unit, Zigbee communication unit, and infrared (IrDA) communication unit. Data Association) communication department, WFD (Wi-Fi Direct) communication department, UWB (Ultra Wideband) communication department, Ant+ communication department, Ethernet communication department, etc., but is not limited thereto.

구체적으로, 프로세서(1620)에서 수행되는 각각의 동작이 서버(미도시)에서 수행되는 경우, 전자 장치(1600)는 통신부(1650)의 와이파이 모듈 또는 이더넷 모듈을 통해 서버와 연결될 수 있으나, 이에 제한되는 것은 아니다. 이때, 서버는 클라우드 기반의 서버를 나타낼 수도 있다. 또한, 전자 장치(16000)는 통신부(1650)의 블루투스 통신부를 통해 음성 신호를 수신하는 외부 전자 장치와 연결될 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 전자 장치(1600)는 통신부(1650)의 와이파이 모듈 및 이더넷 모듈 중 적어도 하나를 통해 음성 신호를 수신하는 외부 전자 장치와 연결될 수도 있다.Specifically, when each operation performed by the processor 1620 is performed by a server (not shown), the electronic device 1600 may be connected to the server through the Wi-Fi module or Ethernet module of the communication unit 1650, but is limited thereto. It doesn't work. At this time, the server may represent a cloud-based server. Additionally, the electronic device 16000 may be connected to an external electronic device that receives a voice signal through the Bluetooth communication unit of the communication unit 1650, but is not limited to this. For example, the electronic device 1600 may be connected to an external electronic device that receives a voice signal through at least one of the Wi-Fi module and the Ethernet module of the communication unit 1650.

사용자 입력부(1660)는 사용자로부터 다양한 인스트럭션을 입력 받을 수 있다. 사용자 입력부(1660)는 키, 터치 패널 및 펜 인식 패널 중 적어도 하나를 포함할 수 있다. 전자 장치(1600)는 키, 터치 패널 및 펜 인식 패널 중 적어도 하나로부터 수신된 사용자 입력에 따라서 다양한 컨텐츠 또는 사용자 인터페이스를 표시할 수 있다. 키는 전자 장치(1600)의 본체 외관의 전면부나 측면부, 배면부 등의 다양한 영역에 형성된 기계적 버튼, 휠 등과 같은 다양한 유형의 키를 포함할 수 있다 터치 패널은 사용자의 터치 입력을 감지하고, 감지된 터치 신호에 해당하는 터치 이벤트 값을 출력할 수 있다. 터치 패널이 표시 패널과 결합하여 터치 스크린(미도시)을 구성한 경우, 터치 스크린은 정전식이나, 감압식, 압전식 등과 같은 다양한 유형의 터치 센서로 구현될 수 있다. 본 개시의 일 실시예에 따른 음성 신호 및 기 설정된 적어도 하나의 트리거 워드의 유사도와 관련된 임계치는 사용자 입력부(1660)를 통해 적응적으로 조정될 수 있으나, 이에 제한되는 것은 아니다.The user input unit 1660 can receive various instructions from the user. The user input unit 1660 may include at least one of a key, a touch panel, and a pen recognition panel. The electronic device 1600 may display various contents or user interfaces according to user input received from at least one of a key, a touch panel, and a pen recognition panel. Keys may include various types of keys such as mechanical buttons, wheels, etc. formed in various areas such as the front, side, or back of the exterior of the main body of the electronic device 1600. The touch panel detects the user's touch input and detects the detected touch input. A touch event value corresponding to a touch signal can be output. When a touch panel is combined with a display panel to form a touch screen (not shown), the touch screen may be implemented with various types of touch sensors such as capacitive, resistive, or piezoelectric. The threshold related to the similarity between the voice signal and at least one preset trigger word according to an embodiment of the present disclosure may be adaptively adjusted through the user input unit 1660, but is not limited thereto.

외부기기 인터페이스부(1670)는 전자 장치(1600)와 다양한 외부 디바이스 사이의 인터페이스 환경을 제공한다. 외부기기 인터페이스부(1670)는 A/V 입출력부를 포함할 수 있다. 외부기기 인터페이스부(1670)는 DVD(Digital Versatile Disk) 및 블루-레이(Blue-ray), 게임 디바이스, 카메라, 컴퓨터, 에어컨, 노트북, 데스크탑, 텔레비전, 디지털 디스플레이 디바이스 등과 같은 외부 디바이스 등과 유/무선으로 접속될 수 있다. 외부기기 인터페이스부(1670)는 연결된 외부기기를 통하여 입력되는 이미지, 영상 및 음성 신호를 전자 장치(1100)의 프로세서(1130)로 전달할 수 있다. 프로세서(1620)는 처리된 2D이미지, 3D 이미지, 영상, 음성 등의 데이터 신호를 연결된 외부 디바이스로 출력되도록 제어할 수 있다. A/V 입출력부는 외부 디바이스의 영상 및 음성 신호를 전자 장치(1600)로 입력할 수 있도록, USB 단자, CVBS(Composite Video Banking Sync) 단자, 컴포넌트 단자, S-비디오 단자(아날로그), DVI(Digital Visual Interface) 단자, HDMI(High Definition Multimedia Interface) 단자, DP(Display Port), 썬더볼트(Thunderbolt), RGB 단자, D-SUB 단자 등을 포함할 수 있다. 본 개시의 일 실시예에 따른 프로세서(1620)는 외부기기 인터페이스부(1670)의 HDMI 단자 등의 인터페이스를 통해 음성 신호를 수신하는 외부 전자 장치와 연결될 수 있다. The external device interface unit 1670 provides an interface environment between the electronic device 1600 and various external devices. The external device interface unit 1670 may include an A/V input/output unit. The external device interface unit 1670 supports wired/wireless external devices such as DVD (Digital Versatile Disk) and Blu-ray, game devices, cameras, computers, air conditioners, laptops, desktops, televisions, digital display devices, etc. can be connected. The external device interface unit 1670 can transmit image, video, and audio signals input through a connected external device to the processor 1130 of the electronic device 1100. The processor 1620 can control data signals such as processed 2D images, 3D images, videos, and voices to be output to a connected external device. The A/V input/output unit has a USB terminal, a CVBS (Composite Video Banking Sync) terminal, a component terminal, an S-video terminal (analog), and a DVI (Digital) terminal so that video and audio signals from an external device can be input to the electronic device 1600. Visual Interface) terminal, HDMI (High Definition Multimedia Interface) terminal, DP (Display Port), Thunderbolt, RGB terminal, D-SUB terminal, etc. The processor 1620 according to an embodiment of the present disclosure may be connected to an external electronic device that receives a voice signal through an interface such as an HDMI terminal of the external device interface unit 1670.

전자 장치(1600)는 전원 공급부(미도시)를 더 포함할 수도 있다. 전원 공급부(미도시)는 프로세서(1620)의 제어에 의해 전자 장치(1600)의 구성 요소에게 전원을 공급할 수 있다. 전원 공급부(미도시)는 프로세서(1620)의 제어에 의해 전원 코드를 통해 외부의 전원 소스에서부터 입력되는 전원을 전자 장치(1600)의 각 구성 요소들에게 공급할 수 있다.The electronic device 1600 may further include a power supply (not shown). A power supply unit (not shown) may supply power to components of the electronic device 1600 under the control of the processor 1620. The power supply unit (not shown) may supply power input from an external power source through a power cord to each component of the electronic device 1600 under the control of the processor 1620.

본 개시의 일 양태에 따라, 음성 신호 처리 방법이 제공된다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호를 획득하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 부호화된 음성 정보를 외부 전자장치로 전송하는 단계를 포함할 수 있다. 인공지능 모델은 상기 부호화된 음성 정보에 기초하여 생성된 복원 음성 신호가 개인 특성을 포함하지 않도록 학습된 것일 수 있다.According to one aspect of the present disclosure, a method for processing voice signals is provided. According to an embodiment of the present disclosure, a voice signal processing method may include obtaining an input voice signal. According to an embodiment of the present disclosure, a voice signal processing method may include inputting an input voice signal into an artificial intelligence model, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. According to an embodiment of the present disclosure, a voice signal processing method may include generating encoded voice information by encoding an output voice signal. According to an embodiment of the present disclosure, a voice signal processing method may include transmitting encoded voice information to an external electronic device. The artificial intelligence model may be trained so that the restored voice signal generated based on the encoded voice information does not include personal characteristics.

본 개시의 일 실시예에 따라, 인공지능 모델은 복원된 음성 신호와 입력 음성 신호 사이의 차이에 관한 제1 손실 함수와 복원된 음성 신호와 타겟 음성 신호 사이의 차이에 관한 제2 손실 함수의 합에 해당하는 제3 손실 함수가 최소화 되도록 학습된 것일 수 있다. 본 개시의 일 실시예에 따라, 타겟 음성 신호는 복수의 사용자로부터 획득된 음성 신호들에 기초하여 생성된 평균 음성 신호인 것일 수 있다. According to an embodiment of the present disclosure, the artificial intelligence model is the sum of a first loss function regarding the difference between the reconstructed speech signal and the input speech signal and a second loss function relating to the difference between the reconstructed speech signal and the target speech signal. The third loss function corresponding to may be learned to be minimized. According to an embodiment of the present disclosure, the target voice signal may be an average voice signal generated based on voice signals obtained from a plurality of users.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호와 타겟 음성 신호의 유사도에 관한 정보를 획득하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 복수의 인공지능 모델 중에서 유사도에 대응되는 인공지능 모델일 수 있다.According to an embodiment of the present disclosure, a method of processing a voice signal may include obtaining information about the similarity between an input voice signal and a target voice signal. According to an embodiment of the present disclosure, the artificial intelligence model may be an artificial intelligence model corresponding to similarity among a plurality of artificial intelligence models.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호에 포함된 사용자의 대화 내용을 식별하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 사용자의 대화 내용에 개인 정보가 포함되어 있는지 여부를 식별하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 입력 음성 신호에 포함된 개인 정보가 제거되도록 입력 음성 신호를 수정하는 단계를 포함할 수 있다.According to an embodiment of the present disclosure, a method of processing a voice signal may include identifying the content of a user's conversation included in an input voice signal. According to an embodiment of the present disclosure, a voice signal processing method may include identifying whether personal information is included in the user's conversation content. According to an embodiment of the present disclosure, a method of processing a voice signal may include modifying an input voice signal so that personal information included in the input voice signal is removed.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 복수의 감정 상태들 중에서 하나를 선택하는 사용자 입력을 획득하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 복수의 감정 상태에 대한 인공지능 모델들 중에서 선택된 감정 상태에 대응되는 인공지능 모델일 수 있다.According to an embodiment of the present disclosure, a method of processing a voice signal may include obtaining a user input for selecting one of a plurality of emotional states. According to an embodiment of the present disclosure, the artificial intelligence model may be an artificial intelligence model corresponding to an emotional state selected from among artificial intelligence models for a plurality of emotional states.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 부호화된 음성 정보가 전달되는 채널을 식별하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 채널의 타입에 기초하여, 부호화된 음성 신호를 수정하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, a voice signal processing method may include identifying a channel through which encoded voice information is transmitted. According to an embodiment of the present disclosure, a voice signal processing method may include modifying an encoded voice signal based on the type of channel.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 음성 신호의 개인 특성을 제거하는지 여부에 관한 사용자 입력을 획득하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 개인 특성을 제거하는 사용자 입력에 기초하여, 입력 음성 신호를 인공지능 모델에 입력함으로써, 출력 음성 신호를 생성하는 단계를 포함할 수 있다.According to one embodiment of the present disclosure, a method of processing a voice signal may include obtaining user input regarding whether to remove personal characteristics of the voice signal. According to an embodiment of the present disclosure, a method of processing a voice signal may include generating an output voice signal by inputting an input voice signal into an artificial intelligence model based on user input from which personal characteristics are removed.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 출력 음성 신호의 음성 특징을 추출하는 단계를 포함할 수 있다. 본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 음성 특징을 부호화함으로써 부호화된 음성 정보를 생성하는 단계를 포함할 수 있다. According to an embodiment of the present disclosure, a method of processing a voice signal may include extracting voice features of an output voice signal. According to an embodiment of the present disclosure, a method of processing a voice signal may include generating encoded voice information by encoding voice features.

본 개시의 일 실시예에 따라, 음성 신호 처리 방법은 부호화된 음성 정보가 개인 특성이 필터링된 것임을 나타내는 정보를 외부 전자장치로 전송하는 단계를 포함할 수 있다.According to an embodiment of the present disclosure, a voice signal processing method may include transmitting information indicating that the encoded voice information has personal characteristics filtered to an external electronic device.

본 개시의 일 실시예에 따라, 개인 특성은 사용자의 억양, 사용자의 발음 또는 사용자의 톤 중에서 적어도 하나를 포함할 수 있다.According to an embodiment of the present disclosure, the personal characteristic may include at least one of the user's accent, the user's pronunciation, or the user's tone.

본 개시의 일 양태에 따라, 음성 신호 처리를 위한 전자 장치가 제공된다. 본 개시의 일 실시예에 따라, 전자 장치는 적어도 하나의 인스트럭션을 저장하는 메모리 및 적어도 하나의 인스트럭션에 따라 동작하는 적어도 하나의 프로세서를 포함할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 획득할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호를 인공지능 모델에 입력함으로써, 입력 음성 신호의 개인 특성이 필터링된 출력 음성 신호를 생성할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호를 부호함으로써, 부호화된 음성 정보를 생성할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보를 외부 전자장치로 전송할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 부호화된 음성 정보에 기초하여 생성된 복원 음성 신호가 개인 특성을 포함하지 않도록 학습된 것일 수 있다.According to one aspect of the present disclosure, an electronic device for voice signal processing is provided. According to an embodiment of the present disclosure, an electronic device may include a memory that stores at least one instruction and at least one processor that operates according to the at least one instruction. According to an embodiment of the present disclosure, at least one processor may acquire an input voice signal by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor may execute at least one instruction to input an input voice signal into an artificial intelligence model, thereby generating an output voice signal in which personal characteristics of the input voice signal are filtered. . According to an embodiment of the present disclosure, at least one processor may generate encoded voice information by executing at least one instruction to encode an output voice signal. According to an embodiment of the present disclosure, at least one processor may transmit encoded voice information to an external electronic device by executing at least one instruction. According to an embodiment of the present disclosure, the artificial intelligence model may be trained so that the restored voice signal generated based on encoded voice information does not include personal characteristics.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호와 타겟 음성 신호의 유사도에 관한 정보를 획득할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 복수의 인공지능 모델 중에서 유사도에 대응되는 인공지능 모델일 수 있다.According to an embodiment of the present disclosure, at least one processor may obtain information about the similarity between an input voice signal and a target voice signal by executing at least one instruction. According to an embodiment of the present disclosure, the artificial intelligence model may be an artificial intelligence model corresponding to similarity among a plurality of artificial intelligence models.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호에 포함된 사용자의 대화 내용을 식별할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 사용자의 대화 내용에 개인 정보가 포함되어 있는지 여부를 식별할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 입력 음성 신호에 포함된 개인 정보가 제거되도록 입력 음성 신호를 수정할 수 있다.According to an embodiment of the present disclosure, at least one processor may identify the content of a user's conversation included in an input voice signal by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor may identify whether personal information is included in the user's conversation content by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor may modify the input voice signal to remove personal information included in the input voice signal by executing at least one instruction.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 복수의 감정 상태들 중에서 하나를 선택하는 사용자 입력을 획득할 수 있다. 본 개시의 일 실시예에 따라, 인공지능 모델은 복수의 감정 상태에 대한 인공지능 모델들 중에서 선택된 감정 상태에 대응되는 인공지능 모델일 수 있다.According to an embodiment of the present disclosure, at least one processor may obtain a user input for selecting one of a plurality of emotional states by executing at least one instruction. According to an embodiment of the present disclosure, the artificial intelligence model may be an artificial intelligence model corresponding to an emotional state selected from among artificial intelligence models for a plurality of emotional states.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보가 전달되는 채널을 식별할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 채널의 타입에 기초하여, 부호화된 음성 신호를 수정할 수 있다. According to an embodiment of the present disclosure, at least one processor can identify a channel through which encoded voice information is transmitted by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor may modify an encoded voice signal based on the type of channel by executing at least one instruction.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 음성 신호의 개인 특성을 제거하는지 여부에 관한 사용자 입력을 획득할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 개인 특성을 제거하는 사용자 입력에 기초하여, 입력 음성 신호를 인공지능 모델에 입력함으로써, 출력 음성 신호를 생성할 수 있다.According to an embodiment of the present disclosure, at least one processor may obtain user input regarding whether to remove personal characteristics of a voice signal by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor is configured to generate an output voice signal by executing at least one instruction, inputting the input voice signal to an artificial intelligence model based on user input removing personal characteristics. You can.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 출력 음성 신호의 음성 특징을 추출할 수 있다. 본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 음성 특징을 부호화함으로써 부호화된 음성 정보를 생성할 수 있다. According to an embodiment of the present disclosure, at least one processor may extract speech features of an output speech signal by executing at least one instruction. According to an embodiment of the present disclosure, at least one processor may generate encoded voice information by executing at least one instruction and encoding voice features.

본 개시의 일 실시예에 따라, 적어도 하나의 프로세서는 적어도 하나의 인스트럭션을 실행함으로써, 부호화된 음성 정보가 개인 특성이 필터링된 것임을 나타내는 정보를 외부 전자장치로 전송할 수 있다.According to an embodiment of the present disclosure, at least one processor may transmit information indicating that the encoded voice information has personal characteristics filtered to an external electronic device by executing at least one instruction.

기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, ‘비일시적 저장매체'는 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다. 예로, '비일시적 저장매체'는 데이터가 임시적으로 저장되는 버퍼를 포함할 수 있다.A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory storage medium' only means that it is a tangible device and does not contain signals (e.g. electromagnetic waves). This term refers to cases where data is semi-permanently stored in a storage medium and temporary storage media. It does not distinguish between cases where it is stored as . For example, a 'non-transitory storage medium' may include a buffer where data is temporarily stored.

일 실시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어를 통해 또는 두개의 사용자 장치들(예: 스마트폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품(예: 다운로더블 앱(downloadable app))의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, methods according to various embodiments disclosed in this document may be provided and included in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. A computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or through an application store or between two user devices (e.g. smartphones). It may be distributed in person or online (e.g., downloaded or uploaded). In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) is stored on a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be temporarily stored or created temporarily.

본 개시에 따른 인공지능과 관련된 기능은 프로세서와 메모리를 통해 동작된다. 프로세서는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서일 수 있다. 하나 또는 복수의 프로세서는, 메모리에 저장된 기 정의된 동작 규칙 또는 인공지능 모델에 따라, 입력 데이터를 처리하도록 제어한다. 또는, 하나 또는 복수의 프로세서가 인공지능 전용 프로세서인 경우, 인공지능 전용 프로세서는, 특정 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계될 수 있다. Functions related to artificial intelligence according to the present disclosure are operated through a processor and memory. The processor may consist of one or multiple processors. At this time, one or more processors may be a general-purpose processor such as a CPU, AP, or DSP (Digital Signal Processor), a graphics-specific processor such as a GPU or VPU (Vision Processing Unit), or an artificial intelligence-specific processor such as an NPU. One or more processors control input data to be processed according to predefined operation rules or artificial intelligence models stored in memory. Alternatively, when one or more processors are dedicated artificial intelligence processors, the artificial intelligence dedicated processors may be designed with a hardware structure specialized for processing a specific artificial intelligence model.

기 정의된 동작 규칙 또는 인공지능 모델은 학습을 통해 만들어진 것을 특징으로 한다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버 및/또는 시스템을 통해 이루어 질 수도 있다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으나, 전술한 예에 한정되지 않는다.Predefined operation rules or artificial intelligence models are characterized by being created through learning. Here, being created through learning means that the basic artificial intelligence model is learned using a large number of learning data by a learning algorithm, thereby creating a predefined operation rule or artificial intelligence model set to perform the desired characteristics (or purpose). It means burden. This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server and/or system. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above.

인공지능 모델은, 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 인공지능 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 인공지능 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다. 인공 신경망은 심층 신경망(DNN:Deep Neural Network)를 포함할 수 있으며, 예를 들어, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등이 있으나, 전술한 예에 한정되지 않는다. An artificial intelligence model may be composed of multiple neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and neural network calculation is performed through calculation between the calculation result of the previous layer and the plurality of weights. Multiple weights of multiple neural network layers can be optimized by the learning results of the artificial intelligence model. For example, a plurality of weights may be updated so that loss or cost values obtained from the artificial intelligence model are reduced or minimized during the learning process. Artificial neural networks may include deep neural networks (DNN), for example, Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), or Deep Q-Networks, etc., but are not limited to the examples described above.

Claims

In the voice signal processing method,
Obtaining an input voice signal (510);
By inputting the input voice signal into an artificial intelligence model, generating an output voice signal in which personal characteristics of the input voice signal are filtered (520);
Generating encoded speech information by encoding the output speech signal (530);
A step 540 of transmitting the encoded voice information to an external electronic device,
A voice signal processing method, characterized in that the artificial intelligence model is trained so that the restored voice signal generated based on the encoded voice information does not include personal characteristics.

According to paragraph 1,
The artificial intelligence model has a third loss corresponding to the sum of a first loss function related to the difference between the restored voice signal and the input voice signal and a second loss function related to the difference between the restored voice signal and the target voice signal. The function is learned to be minimized,
A voice signal processing method, wherein the target voice signal is an average voice signal generated based on voice signals obtained from a plurality of users.

According to paragraph 2,
Further comprising obtaining information about the similarity between the input voice signal and the target voice signal,
A voice signal processing method, characterized in that the artificial intelligence model is an artificial intelligence model corresponding to the similarity among a plurality of artificial intelligence models.

According to any one of claims 1 to 3,
identifying user conversation content included in the input voice signal;
Identifying whether the user's conversation content includes personal information; and
A voice signal processing method further comprising modifying the input voice signal so that the personal information included in the input voice signal is removed.

According to any one of claims 1 to 4,
further comprising obtaining a user input for selecting one of the plurality of emotional states,
A voice signal processing method, wherein the artificial intelligence model is an artificial intelligence model corresponding to the selected emotional state among artificial intelligence models for a plurality of emotional states.

According to any one of claims 1 to 5,
Identifying a channel through which the encoded voice information is transmitted; and
A voice signal processing method further comprising modifying the encoded voice signal based on the type of the channel.

According to any one of claims 1 to 6,
further comprising obtaining user input regarding whether to remove the personal characteristics of the speech signal,
The step of generating the output voice signal is,
A voice signal processing method comprising generating the output voice signal by inputting the input voice signal into the artificial intelligence model based on the user input removing the personal characteristics.

According to any one of claims 1 to 7,
The step of generating the encoded voice information is,
extracting speech features from the output speech signal; and
Generating the encoded speech information by encoding the speech features.

According to any one of claims 1 to 8,
A voice signal processing method further comprising transmitting information indicating that the encoded voice information has personal characteristics filtered to the external electronic device.

According to any one of claims 1 to 9,
A voice signal processing method, wherein the personal characteristic includes at least one of a user's intonation, a user's pronunciation, or a user's tone.

In an electronic device for processing voice signals,
Memories 1510 and 1610 storing at least one instruction; and
Comprising at least one processor (1520, 1620) operating according to the at least one instruction,
The at least one processor (1520, 1620) executes the at least one instruction,
Acquire an input voice signal,
By inputting the input voice signal into an artificial intelligence model, an output voice signal in which personal characteristics of the input voice signal are filtered is generated,
Generate encoded speech information by encoding the output speech signal,
Transmitting the encoded voice information to an external electronic device,
The artificial intelligence model is characterized in that it is learned so that the restored voice signal generated based on the encoded voice information does not include personal characteristics.

According to clause 11,
The artificial intelligence model has a third loss corresponding to the sum of a first loss function related to the difference between the restored voice signal and the input voice signal and a second loss function related to the difference between the restored voice signal and the target voice signal. The function is learned to be minimized,
The electronic device, wherein the target voice signal is an average voice signal generated based on voice signals obtained from a plurality of users.

According to clause 12,
The at least one processor (1520, 1620) executes the at least one instruction,
Obtaining information about the similarity between the input voice signal and the target voice signal,
An electronic device, characterized in that the artificial intelligence model is an artificial intelligence model corresponding to the similarity among a plurality of artificial intelligence models.

According to any one of claims 11 to 13,
The at least one processor (1520, 1620) executes the at least one instruction,
Identify the user's conversation content included in the input voice signal,
Identify whether the user's conversation content contains personal information,
An electronic device characterized in that the input voice signal is modified so that the personal information included in the input voice signal is removed.

According to any one of claims 11 to 14,
The at least one processor (1520, 1620) executes the at least one instruction,
Obtain user input for selecting one of a plurality of emotional states,
An electronic device, characterized in that the artificial intelligence model is an artificial intelligence model corresponding to the selected emotional state among artificial intelligence models for a plurality of emotional states.

According to any one of claims 11 to 15,
The at least one processor (1520, 1620) executes the at least one instruction,
Identifying a channel through which the encoded voice information is transmitted,
An electronic device, characterized in that filtering the encoded voice signal based on the type of the channel.

According to any one of claims 11 to 16,
The at least one processor (1520, 1620) executes the at least one instruction,
the at least one processor obtains user input regarding whether to remove the personal characteristics of the speech signal;
An electronic device, characterized in that generating the output voice signal by inputting the input voice signal into the artificial intelligence model based on the user input removing the personal characteristics.

According to any one of claims 11 to 17,
The at least one processor (1520, 1620) executes the at least one instruction,
Extracting speech features from the output speech signal,
Generating the encoded speech information by encoding the speech feature.

According to any one of claims 11 to 18,
The at least one processor (1520, 1620) executes the at least one instruction,
An electronic device, characterized in that transmitting information indicating that the encoded voice information has personal characteristics filtered out to the external electronic device.

A computer-readable recording medium on which instructions for a computer to perform the method of any one of claims 1 to 10 are recorded.