KR20220095988A

KR20220095988A - Apparatus and method for detecting a voice attack against the voice assistant service

Info

Publication number: KR20220095988A
Application number: KR1020200188055A
Authority: KR
Inventors: 이하윤; 서재우; 심우철
Original assignee: 삼성전자주식회사
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07
Also published as: WO2022145835A1

Abstract

Disclosed are a server and method for detecting a generation of a speech attack for a voice assistant service and a device providing the service. The server providing the voice assistant service for detecting the speech attack for the device comprises: a communication interface; a storage part; and a processor. Therefore, the present invention is capable of preventing a complete speech attack from being generated.

Description

APPARATUS AND METHOD FOR DETECTING A VOICE ATTACK AGAINST THE VOICE ASSISTANT SERVICE

다양한 실시예들은, 보이스 어시스턴트 서비스 및 이를 제공하는 디바이스에 대한 음성 공격을 탐지 및 방어하는 방법에 관한 것이다.Various embodiments relate to a voice assistant service and a method for detecting and preventing a voice attack on a device providing the same.

멀티 미디어 기술 및 네트워크 기술이 발전함에 따라, 사용자는 디바이스를 이용하여 다양한 서비스를 제공받을 수 있게 되었다. 특히, 음성 인식 기술이 발전함에 따라, 사용자는 디바이스에 음성(예를 들어, 발화)을 입력하고, 서비스 제공 에이전트를 통해 음성 입력에 따른 응답 메시지를 수신할 수 있게 되었다.As multimedia technology and network technology develop, a user can receive various services using a device. In particular, as voice recognition technology develops, a user may input a voice (eg, utterance) into a device and receive a response message according to the voice input through a service providing agent.

하지만, 복수의 디바이스를 포함하는 홈 네트워크 환경 등에서 사용자가 오디오를 재생하는 경우, 재생되는 오디오에 음성 공격이 은닉되어 있어, 사용자가 인지하지 못하는 사이에 보이스 어시스턴트가 서비스 활성화되어 사용자의 의도와는 다르게 복수의 디바이스들이 제어되는 문제가 발생할 수 있다.However, when a user plays audio in a home network environment including a plurality of devices, a voice attack is hidden in the reproduced audio. A problem in which a plurality of devices are controlled may occur.

또한, 은닉된 음성 공격을 탐지하기 위해 인공지능 모델이 이용될 수 있으나, 음성 공격이 인공지능 모델에 탐지되는 것을 회피하기 위해 음성 공격을 탐지하는 인공지능 모델을 포함하는 환경에서 음성 공격이 생성되고, 오디오 등에 은닉되는 경우에는 은닉된 음성 공격을 탐지하기 어렵다. 이에 따라, 오디오에 은닉되는 음성 공격이 완성되기전에, 완성된 음성 공격을 생성하기 위한 단계에서 시도되는 음성 공격들을 탐지하고, 방어하기 위한 기술이 요구된다.In addition, an artificial intelligence model may be used to detect a hidden voice attack, but a voice attack is generated in an environment including an artificial intelligence model that detects a voice attack in order to avoid the voice attack being detected by the artificial intelligence model. , it is difficult to detect a hidden voice attack when it is hidden in audio, etc. Accordingly, before the voice attack hidden in the audio is completed, a technique for detecting and defending the voice attacks attempted in the step for generating the completed voice attack is required.

다양한 실시예들은, 보이스 어시스턴트 서비스에 대한 음성 공격을 탐지하는 장치에 있어서, 입력 오디오 신호를 기저장된 오디오 신호들과 비교하여 유사 오디오 신호들을 식별하고, 입력 오디오 신호로부터 변환된 텍스트와 식별된 유사 오디오 신호들로부터 변환된 텍스트들을 비교함으로써, 보이스 어시스턴트 서비스에 대한 음성 공격으로 판단되는 이벤트들을 탐지하고, 완성된 음성 공격이 생성되는 것을 방어하기 위한 동작을 수행할 수 있는 장치 및 그 동작 방법을 제공할 수 있다.Various embodiments provide an apparatus for detecting a voice attack on a voice assistant service, comparing an input audio signal with pre-stored audio signals to identify similar audio signals, text converted from the input audio signal, and identified similar audio To provide an apparatus capable of detecting events determined to be a voice attack against a voice assistant service by comparing texts converted from signals, and performing an operation to prevent a completed voice attack from being generated, and an operating method thereof. can

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 일 실시예에 따른 보이스 어시스턴트 서비스를 제공하는 서버가 디바이스에 대한 음성 공격을 탐지하는 방법은, 상기 디바이스로부터 입력 오디오 신호를 수신하는 단계; 상기 수신된 입력 오디오 신호를, 상기 디바이스로부터 수신되어 상기 서버에 기저장된 복수의 오디오 신호들과 비교하는 단계; 상기 수신된 입력 오디오 신호와 상기 기저장된 복수의 오디오 신호들의 비교 결과에 기초하여, 상기 기저장된 복수의 오디오 신호들 중에서 상기 입력 오디오 신호의 유사 오디오 신호들을 식별하는 단계; ASR(Automatic Speech Recognition)을 수행하여, 상기 수신된 입력 오디오 신호를 제1 텍스트로 변환하는 단계; 상기 식별된 유사 오디오 신호들로부터 변환된 제2 텍스트들을 획득하는 단계; 상기 입력 오디오 신호로부터 변환된 상기 제1 텍스트와 상기 유사 오디오 신호들로부터 변환된 상기 제2 텍스트들을 비교하는 단계; 및 상기 제1 텍스트와 상기 제2 텍스트들의 비교 결과에 기초하여, 상기 입력 오디오 신호가 상기 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지를 판단하는 단계;를 포함하며, 상기 기저장된 복수의 오디오 신호들은, 상기 입력 오디오 신호가 수신되기 이전에 상기 디바이스로부터 상기 서버에게 제공된 것인, 방법일 수 있다.As a technical means for achieving the above technical problem, according to an embodiment, a method for a server providing a voice assistant service to detect a voice attack on a device, the method comprising: receiving an input audio signal from the device; comparing the received input audio signal with a plurality of audio signals received from the device and stored in the server; identifying similar audio signals of the input audio signal from among the plurality of pre-stored audio signals based on a comparison result of the received input audio signal and the plurality of pre-stored audio signals; converting the received input audio signal into first text by performing Automatic Speech Recognition (ASR); obtaining transformed second texts from the identified similar audio signals; comparing the first text converted from the input audio signal with the second text converted from the similar audio signals; and determining whether the input audio signal is an audio signal for attacking the voice assistant service based on a comparison result of the first text and the second text; including, the plurality of pre-stored audio signals are provided from the device to the server before the input audio signal is received.

또한, 일 실시예에 따른 디바이스에 대한 음성 공격을 탐지하기 위한, 보이스 어시스턴트 서비스를 제공하는 서버는, 디바이스와 데이터 통신을 수행하는 통신 인터페이스; 하나 이상의 명령어들(instructions)을 포함하는 프로그램을 저장하는 저장부; 및 상기 저장부에 저장된 상기 프로그램의 하나 이상의 명령어들을 실행하는 프로세서를 포함하고, 상기 프로세서는, 상기 통신 인터페이스를 제어하여 상기 디바이스로부터 입력 오디오 신호를 수신하고, 상기 수신된 입력 오디오 신호를, 상기 디바이스로부터 수신되어 상기 저장부에 기저장된 복수의 오디오 신호들과 비교하고, 상기 수신된 입력 오디오 신호와 상기 기저장된 복수의 오디오 신호들의 비교 결과에 기초하여, 상기 기저장된 복수의 오디오 신호들 중에서 상기 입력 오디오 신호의 유사 오디오 신호들을 식별하고, ASR(Automatic Speech Recognition)을 수행하여, 상기 수신된 입력 오디오 신호를 제1 텍스트로 변환하고, 상기 식별된 유사 오디오 신호들로부터 변환된 제2 텍스트들을 획득하고, 상기 입력 오디오 신호로부터 변환된 상기 제1 텍스트와 상기 유사 오디오 신호들로부터 변환된 상기 제2 텍스트들을 비교하고, 상기 제1 텍스트와 상기 제2 텍스트들의 비교 결과에 기초하여, 상기 입력 오디오 신호가 상기 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단하되, 상기 기저장된 복수의 오디오 신호들은, 상기 입력 오디오 신호가 수신되기 이전에 상기 디바이스로부터 상기 서버에게 제공된 것인, 서버일 수 있다.In addition, according to an embodiment, a server providing a voice assistant service for detecting a voice attack on a device includes: a communication interface for performing data communication with the device; a storage unit for storing a program including one or more instructions; and a processor executing one or more instructions of the program stored in the storage, wherein the processor controls the communication interface to receive an input audio signal from the device, and transmit the received input audio signal to the device compared with a plurality of audio signals received from and stored in the storage unit, and based on a comparison result of the received input audio signal and the plurality of pre-stored audio signals, the input from among the plurality of pre-stored audio signals Identifying analogous audio signals of an audio signal, performing Automatic Speech Recognition (ASR), converting the received input audio signal into a first text, and obtaining converted second texts from the identified analogous audio signals, and , comparing the first text converted from the input audio signal with the second text converted from the similar audio signals, and based on a comparison result of the first text and the second text, the input audio signal is It may be determined whether the audio signal is an audio signal for attacking the voice assistant service, wherein the plurality of pre-stored audio signals are provided from the device to the server before the input audio signal is received.

일 실시예에 따른 음성 공격을 탐지하는 서버는, 소정 기간 동안 유사한 오디오 신호들이 복수 개 입력되고, 입력된 유사 오디오 신호들로부터 변환된 텍스트들의 변형 정도가 소정 기준 이상인지 여부에 기초하여, 적대적 사용자의 음성 공격을 생성하기 위한 오디오 신호를 탐지함으로써, 완성된 음성 공격을 생성하기 위해 입력되는 음성 공격들을 조기에 발견할 수 있다. 또한, 서버는 음성 공격 탐지 결과에 기초하여 음성 공격을 방어하기 위한 동작들을 수행함으로써, 완성된 음성 공격이 생성되는 것을 방지할 수 있다. The server for detecting a voice attack according to an embodiment may include, based on whether a plurality of similar audio signals are input for a predetermined period, and the degree of transformation of texts converted from the input similar audio signals is greater than or equal to a predetermined standard, the hostile user By detecting an audio signal for generating a voice attack of In addition, the server may prevent a completed voice attack from being generated by performing operations for preventing the voice attack based on the voice attack detection result.

도 1a는 일 실시예에 따른 음성 공격 생성 단계 및 서버가 음성 공격 생성 단계를 탐지하는 동작을 설명하기 위한 도면이다.
도 1b는 일 실시예에 따른 서버가 탐지하는, 음성 공격 생성 단계를 설명하기 위한 도면으로, 도 1a의 음성 공격 생성 단계를 구체적으로 설명하기 위한 흐름도이다.
도 2는 일 실시예에 따른 서버의 구성을 도시한 블록도이다.
도 3은 일 실시예에 따른 서버가 디바이스에 대한 음성 공격의 생성을 탐지하기 위해 동작하는 방법들을 도시한 흐름도이다.
도 4a 및 4b는 일 실시예에 따른 서버가 오디오 비교 모듈을 이용하여 입력 오디오 신호와 기저장된 복수의 오디오 신호들을 비교하는 방법을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 서버가 텍스트 비교 모듈을 이용하여 입력 오디오 신호로부터 변환된 제1 텍스트와 기저장된 복수의 오디오 신호들로부터 변환된 제2 텍스트들을 비교하는 방법을 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 서버가 공격 판단 모듈을 이용하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단하는 방법을 설명하기 위한 도면이다.
도 7은 일 실시예에 따른 서버가 공격 대응 모듈을 이용하여 음성 공격의 생성을 방어하기 위한 동작을 결정하는 방법을 설명하기 위한 도면이다.
도 8은 일 실시예에 따른 서버가 디바이스에 대한 음성 공격의 생성을 탐지하기 위해 동작하는 다른 실시예를 설명하기 위한 도면이다.
도 9는 일 실시예에 따른 서버가 데이터베이스에 입력 오디오 신호를 수신하여 기저장된 복수의 오디오 신호들을 생성하는 방법을 설명하기 위한 도면이다.
도 10은 일 실시예에 따른 서버가 음성 공격의 생성을 탐지하고, 탐지 결과에 기초하여 디바이스 제어 동작을 수행하거나 음성 공격 생성의 방어 동작을 수행하는 방법을 설명하기 위한 흐름도이다.
도 11은 일 실시예에 따른 클라이언트 디바이스의 블록도이다.1A is a diagram for describing an operation of generating a voice attack and a server detecting a voice attack generating stage, according to an exemplary embodiment.
FIG. 1B is a diagram for explaining a step of generating a voice attack detected by a server according to an exemplary embodiment, and is a flowchart illustrating in detail the step of generating a voice attack of FIG. 1A .
2 is a block diagram illustrating a configuration of a server according to an embodiment.
3 is a flow diagram illustrating methods in which a server operates to detect the generation of a voice attack on a device, according to an embodiment.
4A and 4B are diagrams for explaining a method in which a server compares an input audio signal with a plurality of pre-stored audio signals using an audio comparison module according to an exemplary embodiment.
FIG. 5 is a diagram for describing a method in which a server compares first text converted from an input audio signal with second text converted from a plurality of pre-stored audio signals using a text comparison module, according to an exemplary embodiment.
6 is a diagram for explaining a method for a server to determine whether an input audio signal is an audio signal for an attack on a voice assistant service using an attack determination module, according to an exemplary embodiment.
7 is a diagram for describing a method for a server to determine an operation for preventing generation of a voice attack by using an attack response module according to an embodiment.
8 is a diagram for describing another embodiment in which a server operates to detect generation of a voice attack on a device according to an embodiment.
9 is a diagram for describing a method in which a server receives an input audio signal in a database and generates a plurality of pre-stored audio signals according to an exemplary embodiment.
10 is a flowchart illustrating a method in which a server detects generation of a voice attack and performs a device control operation or a defense operation of generating a voice attack based on a detection result, according to an exemplary embodiment.
11 is a block diagram of a client device according to an embodiment.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다.Terms used in this specification will be briefly described, and the present invention will be described in detail.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present invention have been selected as currently widely used general terms as possible while considering the functions in the present invention, but these may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the term used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, rather than the name of a simple term.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 용어들은 본 명세서에 기재된 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 또한, 본 명세서에서 사용되는 '제1' 또는 '제2' 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용할 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다.The singular expression may include the plural expression unless the context clearly dictates otherwise. Terms used herein, including technical or scientific terms, may have the same meanings as commonly understood by one of ordinary skill in the art described herein. Also, terms including an ordinal number such as 'first' or 'second' used in this specification may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In the entire specification, when a part "includes" a certain element, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “unit” and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software.

아래에서는 첨부한 도면을 참고하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

개시된 실시예에서, 음성 명령(Voice command)은 클라이언트 디바이스 및 주변 디바이스 중 적어도 하나를 제어하기 위한 명령이 포함되는 오디오 신호를 말한다. 음성 명령은 클라이언트 디바이스를 호출하는 호출 명령 및 클라이언트 디바이스 및 주변 디바이스 중 적어도 하나를 제어하기 위한 명령을 포함할 수 있다.In the disclosed embodiment, a voice command refers to an audio signal including a command for controlling at least one of a client device and a peripheral device. The voice command may include a call command for calling the client device and a command for controlling at least one of the client device and a peripheral device.

개시된 실시예에서, 음성 공격(Voice attack)은 클라이언트 디바이스에서 음성 명령으로 인식되는 공격 오디오(Adversarial audio) 신호를 말한다. 즉, 음성 공격은 정상적으로 보이스 어시스턴트 서비스를 사용하는 일반적인 사용자에 의한 오디오 신호가 아니지만, 클라이언트 디바이스에 의해 음성 명령으로 인식되는 오디오 신호일 수 있다. 또한, 음성 공격은 오디오, 비디오 파일 등에 은닉되어 있을 수 있다. 오디오, 비디오 파일 등에 은닉된 음성 공격이 보이스 어시스턴트 서비스를 이용하고자 하는 일반적인 사용자의 환경에서 재생되는 경우, 사용자의 클라이언트 디바이스는 은닉된 음성 공격을 사용자의 음성 명령으로 인식하여 잘못 동작할 수 있다.In the disclosed embodiment, a voice attack refers to an adversarial audio signal recognized as a voice command in a client device. That is, the voice attack is not an audio signal by a general user who normally uses the voice assistant service, but may be an audio signal recognized as a voice command by the client device. In addition, the voice attack may be hidden in audio, video files, and the like. When a voice attack hidden in an audio or video file is reproduced in the environment of a general user who wants to use the voice assistant service, the user's client device may recognize the hidden voice attack as the user's voice command and operate erroneously.

개시된 실시예에서, 적대적 사용자(Adversarial user)란, 적대적 사용자 자신의 클라이언트 디바이스를 이용하여 음성 공격을 입력 및 변조하는 사용자를 말한다. 적대적 사용자는 자신의 보이스 어시스턴트 서비스 사용 환경에서, 음성 공격이 적대적 사용자가 원하는 음성 명령으로 인식될 때까지 반복적으로 음성 명령으로 인식되는 음성 공격들을 입력하고 변조할 수 있다. 또한, 적대적 사용자는 음성 공격이 클라이언트 디바이스에 의해 적대적 사용자가 원하는 음성 명령으로 인식되는 경우, 완성된 공격 오디오를 음성 공격으로 하여, 전술한 오디오, 비디오 파일 등에 은닉할 수 있다. 오디오, 비디오 파일 등에 은닉된 음성 공격이 전술한 다른 사용자인 일반적인 사용자의 클라이언트 디바이스 등에 의해 재생되는 경우, 일반적인 사용자의 클라이언트 디바이스는 은닉된 음성 공격을 음성 명령으로 잘못 인식할 수 있다.In the disclosed embodiment, the adversarial user refers to a user who inputs and modulates a voice attack using the hostile user's own client device. The hostile user may repeatedly input and tamper with voice attacks recognized as voice commands in his/her voice assistant service usage environment until the voice attacks are recognized as voice commands desired by the hostile user. In addition, when a voice attack is recognized by the client device as a voice command desired by the hostile user, the hostile user may use the completed attack audio as a voice attack, and may hide the aforementioned audio and video files. When a voice attack hidden in an audio or video file is reproduced by a client device of a general user, which is another user described above, the general user's client device may erroneously recognize the hidden voice attack as a voice command.

음성 공격 생성 단계(Voice attack generation phase)란, 적대적 사용자에 의해 수행되는, 음성 공격의 입력 및 변조가 반복되는 동작들을 말한다. 적대적 사용자는, 음성 공격이 적대적 사용자가 원하는 음성 명령으로 인식될 때까지 반복적으로 음성 공격을 변조하므로, 개시된 실시예에 따른 서버는 음성 공격의 입력 및 변조가 반복되는 동작들인 음성 공격 생성 단계를 탐지할 수 있다.The voice attack generation phase refers to operations in which input and modulation of a voice attack are repeated, performed by a hostile user. Since the hostile user repeatedly modulates a voice attack until the voice attack is recognized as a voice command desired by the hostile user, the server according to the disclosed embodiment detects the voice attack generation step, in which input and modulation of the voice attack are repeated operations. can do.

개시된 실시예에서, 클라이언트 디바이스 및 서버는, 적대적 사용자의 음성 공격을 음성 명령으로 인식할 수 있으므로, 입력되는 음성 명령이 적대적 사용자의 음성 공격인지 보이스 어시스턴트 서비스를 사용하는 일반 사용자의 음성 명령인지 알 수 없을 수 있다.In the disclosed embodiment, the client device and the server can recognize the hostile user's voice attack as a voice command, so that it can be recognized whether the input voice command is a voice attack of the hostile user or a voice command of a general user using the voice assistant service. there may not be

따라서, 클라이언트 디바이스에 입력된 신호가 적대적 사용자의 음성 공격이 음성 명령으로 인식된 것인지 정상적으로 보이스 어시스턴트 서비스를 사용하는 일반 사용자의 음성 명령이 입력된 것인지 여부는, 서버가 개시된 실시예들에 따른 동작들을 수행하여 음성 공격 생성 단계를 탐지하기 전에는 알 수 없다. 따라서, 후술하는 실시예들에서 클라이언트 디바이스에 입력되는 신호는 입력 오디오 신호로 서술하기로 한다.Accordingly, whether the signal input to the client device is recognized as a voice attack of a hostile user or a voice command of a normal user who normally uses the voice assistant service is determined by the server performing operations according to the disclosed embodiments. You don't know until you detect the stage of generating a voice attack by performing it. Accordingly, in embodiments to be described later, a signal input to the client device will be described as an input audio signal.

서버는 개시된 실시예들에 따른 음성 공격 생성 단계를 탐지하는 동작들을 수행하여, 클라이언트 디바이스에 입력된 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 신호인지 여부를 판단할 수 있다.The server may determine whether the audio signal input to the client device is a signal for an attack on the voice assistant service by performing operations of detecting the step of generating a voice attack according to the disclosed embodiments.

도 1a는 일 실시예에 따른 음성 공격 생성 단계 및 서버가 음성 공격 생성 단계를 탐지하는 동작을 설명하기 위한 도면이다.1A is a diagram for explaining a step of generating a voice attack and an operation in which a server detects a step of generating a voice attack, according to an embodiment.

도 1a를 참조하면, 일 실시예에 따른 서버(2000)는 보이스 어시스턴트 서비스를 제공하는 서버(2000)일 수 있다. 서버(2000)는 적대적 사용자(Adversarial user)(150)가 완성된 공격 오디오를 생성하기 위해 시도하는 일련의 단계들로 구성되는, 음성 공격 생성 단계(Voice attack generation phase)(100)를 탐지할 수 있다.Referring to FIG. 1A , a server 2000 according to an exemplary embodiment may be a server 2000 providing a voice assistant service. The server 2000 can detect a Voice attack generation phase 100 , which consists of a series of steps that an Adversarial user 150 attempts to generate a complete attack audio. have.

적대적 사용자(150)는 완성된 공격 오디오를 생성하기 위해, 다수의 공격 오디오들을 조금씩 변조하면서 출력하고, 다수의 공격 오디오들에 대하여 서버(2000)에서 ASR(Automated Speech Recognition)이 수행된 결과 획득된 텍스트들을 확인한다. 구체적으로, 적대적 사용자(150)는 완성된 공격 오디오가 서버(2000)에서 ASR이 수행되어 텍스트로 변환될 때, 적대적 사용자(150)가 원하는 텍스트로 전사(transcribe)되도록 하기 위하여, 공격 오디오를 출력하면서, 출력된 공격 오디오로부터 변환된 텍스트를 확인하고, 출력된 공격 오디오를 변조하여 재출력하는 단계들을 반복할 수 있다.The hostile user 150 modulates and outputs a plurality of attack audios little by little to generate a completed attack audio, and the server 2000 performs Automated Speech Recognition (ASR) on the plurality of attack audios. Check the texts. Specifically, when the completed attack audio is converted into text by performing ASR on the server 2000, the hostile user 150 outputs the attack audio to be transcribed into the text desired by the hostile user 150 While doing so, the steps of checking the converted text from the output attack audio, modulating the output attack audio and re-outputting it may be repeated.

예를 들어, 적대적 사용자(150)는 서버(2000)에서 "하이 빅스비, 현관문 열어줘"라는 텍스트로 전사되는 완성된 공격 오디오를 생성하기 위하여, 초기 공격 오디오를 생성할 수 있다.For example, the hostile user 150 may generate the initial attack audio in order to generate the completed attack audio that is transcribed into the text “Hi Bixby, open the front door” in the server 2000 .

적대적 사용자(150)는 초기 공격 오디오를 클라이언트 디바이스를 경유하여 서버(2000)로 전송하고, 서버(2000)로부터 클라이언트 디바이스로 전송된 초기 공격 오디오로부터 변환된 텍스트를 확인할 수 있다. 적대적 사용자(150)는 초기 공격 오디오로부터 변환된 텍스트가 "하이 빅스비, 현관문 열어줘" 가 아닌 경우, 초기 공격 오디오를 변조하여 제1 공격 오디오(101)를 생성할 수 있다. 이 경우, 제1 공격 오디오(101)는 초기 공격 오디오와 오디오 신호의 파형은 유사하나, 제1 공격 오디오(101)로부터 변환된 텍스트와 초기 공격 오디오로부터 변환된 텍스트는 다를 수 있다. 같은 방식으로, 적대적 사용자(150)는 제1 공격 오디오(101)로부터 변환된 텍스트가 적대적 사용자(150)가 원하는 텍스트가 아닌 경우, 제1 공격 오디오(101)를 변조하여 제2 공격 오디오(102)를 생성할 수 있다. 적대적 사용자(150)는 전술한 일련의 단계들을 반복하여 "하이 빅스비, 현관문 열어줘"라는 텍스트로 전사되는 완성된 공격 오디오를 생성할 수 있다. The hostile user 150 may transmit the initial attack audio to the server 2000 via the client device and check text converted from the initial attack audio transmitted from the server 2000 to the client device. The hostile user 150 may generate the first attack audio 101 by modulating the initial attack audio when the text converted from the initial attack audio is not “Hi Bixby, open the front door”. In this case, in the first attack audio 101 , the waveform of the initial attack audio and the audio signal is similar, but the text converted from the first attack audio 101 and the text converted from the initial attack audio may be different. In the same way, when the text converted from the first attack audio 101 is not the text desired by the hostile user 150, the hostile user 150 modulates the first attack audio 101 to obtain the second attack audio 102 ) can be created. The hostile user 150 may repeat the above-described series of steps to generate the completed attack audio transcribed into the text “Hi Bixby, open the front door”.

일 실시예에 따른 서버(2000)는, 적대적 사용자(150)에 의해 완성된 공격 오디오가 생성되기 이전에, 음성 공격 생성 단계(2000)를 탐지할 수 있다. 구체적으로, 유사한 오디오 신호들이 복수개 입력되고, 유사한 오디오 신호들로부터 변환된 텍스트들이 서로 차이가 나는 정도에 기초하여, 이를 적대적 사용자(150)에 의해 시도되는 음성 공격 생성 단계(100)로 판단할 수 있다.The server 2000 according to an embodiment may detect the voice attack generation step 2000 before the attack audio completed by the hostile user 150 is generated. Specifically, a plurality of similar audio signals are input, and based on the degree to which texts converted from similar audio signals differ from each other, it can be determined as the voice attack generation step 100 attempted by the hostile user 150 . have.

도 1b는 일 실시예에 따른 서버가 탐지하는, 음성 공격 생성 단계를 설명하기 위한 도면으로, 도 1a의 음성 공격 생성 단계를 구체적으로 설명하기 위한 흐름도이다.FIG. 1B is a diagram for explaining a step of generating a voice attack detected by a server according to an exemplary embodiment, and is a flowchart for describing the step of generating a voice attack of FIG. 1A in detail.

단계 S110에서, 적대적 사용자(150)는 초기 공격 오디오를 스피커 등을 통해 출력할 수 있다. 스피커 등에서 출력되는 초기 공격 오디오는 클라이언트 디바이스(1000)를 호출하기 위한 호출 명령(wake up command) 및 클라이언트 디바이스(1000) 및 주변 기기들 중 적어도 하나를 제어하기 위한 제어 명령을 포함할 수 있다. 초기 공격 오디오가 클라이언트 디바이스(1000)에 입력되면, 클라이언트 디바이스(1000)는 초기 공격 오디오를 음성 명령으로 인식하여 활성화 될 수 있다.In step S110, the hostile user 150 may output the initial attack audio through a speaker or the like. The initial attack audio output from the speaker may include a wake up command for calling the client device 1000 and a control command for controlling at least one of the client device 1000 and peripheral devices. When the initial attack audio is input to the client device 1000 , the client device 1000 may recognize the initial attack audio as a voice command and be activated.

단계 S120에서, 클라이언트 디바이스(1000)는 공격 오디오를 음성 명령으로 인식하여 인식된 음성 명령에 대한 보이스 어시스턴트 서비스를 제공하기 위한 기능이 활성화 될 수 있다. 클라이언트 디바이스(1000)의 입장에서는, 음성 명령을 입력하는 사용자가 적대적 사용자(150)인지, 보이스 어시스턴트 서비스를 사용하고자하는 일반 사용자인지 알 수 없다. 이 경우, 클라이언트 디바이스(1000)는 적대적 사용자(150)의 공격 오디오를 음성 명령으로 인식하기 때문에, 인식된 음성 명령을 나타내는 오디오 신호를 서버로 전송하고, 음성 명령에 대한 응답을 서버(2000)로 요청하게 될 수 있다. 또한, 클라이언트 디바이스(1000)로부터 서버(2000)로 음성 명령을 나타내는 오디오 신호는 서버(2000)의 오디오 신호 데이터 베이스에 저장되어, 기저장된 오디오 신호 데이터베이스를 형성할 수 있다.In operation S120 , the client device 1000 may recognize the attack audio as a voice command, and a function for providing a voice assistant service for the recognized voice command may be activated. From the point of view of the client device 1000 , it is not possible to know whether a user inputting a voice command is the hostile user 150 or a general user who wants to use the voice assistant service. In this case, since the client device 1000 recognizes the attack audio of the hostile user 150 as a voice command, it transmits an audio signal representing the recognized voice command to the server, and sends a response to the voice command to the server 2000 . may be requested Also, an audio signal representing a voice command from the client device 1000 to the server 2000 may be stored in the audio signal database of the server 2000 to form a pre-stored audio signal database.

단계 S130에서, 일 실시예에 따른 서버(2000)는 수신된 오디오 신호에 ASR을 수행하여 텍스트로 변환하고, 자연어 이해 모델(Natural Language Understanding Model; NLU)을 이용하여 수신된 오디오 신호로부터 변환된 텍스트들을 분석할 수 있다. 서버(2000)는 분석 결과에 기초하여, 클라이언트 디바이스(1000) 또는 주변 디바이스들을 제어하기 위한 제어 정보를 생성하고, 생성된 제어 정보 및 수신된 오디오 신호로부터 변환된 텍스트를 포함하는, 음성 명령에 대한 응답을 생성할 수 있다. 서버(2000)는 음성 명령에 대한 응답을 클라이언트 디바이스(1000)로 제공할 수 있다.In step S130, the server 2000 according to an embodiment performs ASR on the received audio signal to convert it into text, and uses a Natural Language Understanding Model (NLU) to convert text from the received audio signal. can be analyzed. The server 2000 generates control information for controlling the client device 1000 or peripheral devices based on the analysis result, and includes text converted from the generated control information and the received audio signal. You can create a response. The server 2000 may provide a response to the voice command to the client device 1000 .

단계 S140에서, 클라이언트 디바이스(1000)는 서버(2000)로부터 음성 명령에 대한 응답을 수신하고, 수신된 응답에 대응되는 동작을 수행할 수 있다. 다만, 클라이언트 디바이스(1000)가 초기 공격 오디오를 음성 명령으로 인식하고, 서버(2000)가 초기 공격 오디오에 대응되는 오디오 신호에 ASR을 수행하여 변환한 텍스트는, 명확한 의미 요소들을 포함하지 않아 제어 정보 등이 생성되지 않았거나, 적대적 사용자(150)가 의도하지 않은 제어 정보가 생성되었을 수 있다. 따라서, 클라이언트 디바이스(1000)는 동작하지 않거나, 적대적 사용자(150)가 의도하지 않은 동작을 수행할 수도 있다.In operation S140 , the client device 1000 may receive a response to the voice command from the server 2000 and may perform an operation corresponding to the received response. However, since the client device 1000 recognizes the initial attack audio as a voice command and the server 2000 performs ASR on the audio signal corresponding to the initial attack audio, the converted text does not include clear semantic elements, so control information etc. may not have been generated, or control information unintended by the hostile user 150 may have been generated. Accordingly, the client device 1000 may not operate or may perform an operation not intended by the hostile user 150 .

단계 S150에서, 적대적 사용자(150)는 클라이언트 디바이스(1000)로부터, 수신된 응답에 포함되는 공격 오디오로부터 변환된 텍스트를 획득할 수 있다. 또한, 적대적 사용자(150)는 변환된 텍스트를 확인하여, 적대적 사용자(150)가 원하는 텍스트(예를 들어, "하이 빅스비, 현관문 열어줘")가 획득되었는지 확인할 수 있다.In step S150 , the hostile user 150 may obtain the text converted from the attack audio included in the received response from the client device 1000 . In addition, the hostile user 150 may check the converted text to confirm whether the text desired by the hostile user 150 has been obtained (eg, “Hi Bixby, open the front door”).

적대적 사용자(150)가 원하는 텍스트가 획득되지 않은 경우, 초기 공격 오디오를 변조하여 변조된 공격 오디오를 생성할 수 있다.When the text desired by the hostile user 150 is not obtained, the modulated attack audio may be generated by modulating the initial attack audio.

단계 S160에서, 적대적 사용자(150)는 변조된 공격 오디오(예를 들어, 도 1a의 제1 공격 오디오(101)를 출력할 수 있다.In step S160 , the hostile user 150 may output the modulated attack audio (eg, the first attack audio 101 of FIG. 1A ).

이상에서 설명한 것과 같이, 적대적 사용자(150)는, 완성된 공격 오디오가 획득될 때까지 단계 S120 내지 S160을 반복하는, 음성 공격 생성 단계(100)를 수행할 수 있다.As described above, the hostile user 150 may perform the voice attack generation step 100, repeating steps S120 to S160 until the completed attack audio is obtained.

만약, 서버(2000)가 음성 공격 생성 단계(100)를 탐지하지 못한다면, 음성 공격이 완성되고, 완성된 음성 공격이 오디오 파일 또는 비디오 파일 등에 은닉될 수 있다. 은닉된 음성 공격이 일반 사용자의 클라이언트 디바이스에 입력되면, 은닉된 음성 공격은 일반 사용자의 음성 명령으로 인식되므로, 일반 사용자의 클라이언트 디바이스 및 주변 디바이스 등이 제어될 수 있다.If the server 2000 does not detect the voice attack generating step 100 , the voice attack is completed, and the completed voice attack may be hidden in an audio file or a video file. When the hidden voice attack is input to the general user's client device, the hidden voice attack is recognized as the general user's voice command, so that the general user's client device and peripheral devices can be controlled.

일 실시예에 따른 서버(2000)는, 음성 공격 생성 단계(100)를 탐지할 수 있다. 구체적으로, 서버(2000)는 음성 공격 생성 단계(100)에서 입력되는, 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지를 판단할 수 있다. 서버(2000)는 입력된 오디오 신호가 적대적 사용자(150)에 의해 입력되는, 음성 공격을 생성하는 단계의 입력 오디오 신호인 것으로 판단되는 경우, 음성 공격 생성을 방어하기 위한 동작을 수행할 수 있다. 서버(2000)가 음성 공격 생성 단계를 탐지하는 구체적인 방법은, 도 3 내지 도 8에 대한 설명에서 상세하게 서술하기로 한다.The server 2000 according to an embodiment may detect the voice attack generation step 100 . Specifically, the server 2000 may determine whether the input audio signal input in the voice attack generation step 100 is an audio signal for an attack on the voice assistant service. When it is determined that the input audio signal is the input audio signal of the step of generating a voice attack input by the hostile user 150 , the server 2000 may perform an operation for preventing the voice attack generation. A specific method for the server 2000 to detect the step of generating a voice attack will be described in detail with reference to FIGS. 3 to 8 .

도 2는 일 실시예에 따른 서버의 구성을 도시한 블록도이다.2 is a block diagram illustrating a configuration of a server according to an embodiment.

도 2를 참조하면, 서버(2000)는 통신 인터페이스(2100), 프로세서(2200) 및 저장부(2300)를 포함할 수 있다.Referring to FIG. 2 , the server 2000 may include a communication interface 2100 , a processor 2200 , and a storage unit 2300 .

통신 인터페이스(2100)는 프로세서(2200)의 제어에 의해 클라이언트 디바이스(1000)와 데이터 통신을 수행할 수 있다. 또한, 통신 인터페이스(2100)는 클라이언트 디바이스(1000)뿐 아니라, 다른 주변 디바이스들(3000)과도 데이터 통신을 수행할 수 있다.The communication interface 2100 may perform data communication with the client device 1000 under the control of the processor 2200 . Also, the communication interface 2100 may perform data communication not only with the client device 1000 , but also with other peripheral devices 3000 .

통신 인터페이스(2100)는 예를 들어, 유선 랜, 무선 랜(Wireless LAN), 와이파이(Wi-Fi), 블루투스(Bluetooth), 지그비(zigbee), WFD(Wi-Fi Direct), 적외선 통신(IrDA, infrared Data Association), BLE (Bluetooth Low Energy), NFC(Near Field Communication), 와이브로(Wireless Broadband Internet, Wibro), 와이맥스(World Interoperability for Microwave Access, WiMAX), SWAP(Shared Wireless Access Protocol), 와이기그(Wireless Gigabit Allicance, WiGig) 및 RF 통신을 포함하는 데이터 통신 방식 중 적어도 하나를 이용하여 클라이언트 디바이스(1000) 또는 다른 주변 디바이스들(3000)과 데이터 통신을 수행할 수 있다.Communication interface 2100 is, for example, wired LAN, wireless LAN (Wireless LAN), Wi-Fi (Wi-Fi), Bluetooth (Bluetooth), Zigbee (zigbee), WFD (Wi-Fi Direct), infrared communication (IrDA, Infrared Data Association), BLE (Bluetooth Low Energy), NFC (Near Field Communication), WiBro (Wireless Broadband Internet, Wibro), WiMAX (World Interoperability for Microwave Access, WiMAX), SWAP (Shared Wireless Access Protocol), WiGig Data communication may be performed with the client device 1000 or other peripheral devices 3000 using at least one of a data communication method including (Wireless Gigabit Alliance, WiGig) and RF communication.

일 실시예에 따른 통신 인터페이스(2100)는 클라이언트 디바이스(1000) 또는 주변 디바이스들(3000)을 제어하기 위하여, 클라이언트 디바이스(1000)로부터 음성 명령 및 음성 명령에 대한 응답 요청을 수신하고, 음성 명령에 대한 응답을 클라이언트 디바이스(1000)로 전송할 수 있다.In order to control the client device 1000 or the peripheral devices 3000 , the communication interface 2100 according to an embodiment receives a voice command and a response request to the voice command from the client device 1000 , and responds to the voice command. A response may be transmitted to the client device 1000 .

프로세서(2200)는 저장부(2300)에 저장된 프로그램의 하나 이상의 명령어들(instructions)을 실행할 수 있다. 프로세서(2200)는 산술, 로직 및 입출력 연산과 시그널 프로세싱을 수행하는 하드웨어 구성 요소로 구성될 수 있다. The processor 2200 may execute one or more instructions of a program stored in the storage 2300 . The processor 2200 may include hardware components that perform arithmetic, logic, input/output operations and signal processing.

프로세서(2200)는 예를 들어, 중앙 처리 장치(Central Processing Unit), 마이크로 프로세서(microprocessor), 그래픽 프로세서(Graphic Processing Unit), ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), AP(Application Processor), 뉴럴 프로세서(Neural Processing Unit) 또는 인공지능 모델의 처리에 특화된 하드웨어 구조로 설계된 인공지능 전용 프로세서 중 적어도 하나로 구성될 수 있으나, 이에 제한되는 것은 아니다.The processor 2200 may include, for example, a central processing unit (Central Processing Unit), a microprocessor, a graphic processor (Graphic Processing Unit), Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), and Digital Signal Processors (DSPDs). Signal Processing Devices), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), Application Processors (APs), Neural Processing Units, or artificial intelligence-only processors designed with a hardware structure specialized for processing artificial intelligence models. It may consist of at least one, but is not limited thereto.

일 실시예에 따른 프로세서(2200)는 수신된 음성 명령을 나타내는 오디오 신호에 대하여 ASR(Automated Speech Recognition)을 수행하여, 오디오 신호를 텍스트로 변환할 수 있다. 또한, 프로세서(2200)는 수신된 오디오 신호로부터 변환된 텍스트들을 분석하여, 수신된 오디오 신호가 사용자에 의해 입력된 오디오 신호인지 여부를 판단할 수 있다. 또한, 프로세서(2200)는 자연어 이해 모델(Natural Language Understanding Model; NLU)을 이용하여 수신된 오디오 신호로부터 변환된 텍스트들을 분석하고, 분석 결과에 기초하여 클라이언트 디바이스(1000) 또는 주변 디바이스들(3000)을 제어하기 위한 제어 정보를 생성하여 클라이언트 디바이스(1000)로 전송할 수 있다.The processor 2200 according to an embodiment may convert the audio signal into text by performing Automated Speech Recognition (ASR) on the audio signal representing the received voice command. Also, the processor 2200 may analyze texts converted from the received audio signal to determine whether the received audio signal is an audio signal input by a user. In addition, the processor 2200 analyzes the text converted from the received audio signal using a natural language understanding model (NLU), and based on the analysis result, the client device 1000 or the peripheral devices 3000 . may generate control information for controlling the , and transmit it to the client device 1000 .

저장부(2300)는 예를 들어, 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나를 포함하는 비휘발성 메모리 및 램(RAM, Random Access Memory) 또는 SRAM(Static Random Access Memory)과 같은 휘발성 메모리를 포함할 수 있다.The storage unit 2300 is, for example, a flash memory type, a hard disk type, a multimedia card micro type, or a card type memory (eg, SD or XD). Non-volatile memory including at least one of memory, etc.), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, and optical disk and volatile memory such as random access memory (RAM) or static random access memory (SRAM).

저장부(2300)에는 프로세서(2200)가 판독할 수 있는 명령어들, 데이터 구조, 및 프로그램 코드(program code)가 저장될 수 있다. 개시된 실시예들에서, 프로세서(2200)가 수행하는 동작들은 저장부(2300)에 저장된 프로그램의 명령어들 또는 코드들을 실행함으로써 구현될 수 있다. 또한, 저장부(2300)에는 공격 관리 모듈(2310), 보이스 어시스턴트 모듈(2320), 데이터베이스(2330)에 대응되는 데이터 및 프로그램 명령어 코드들이 저장될 수 있다.The storage unit 2300 may store instructions, data structures, and program codes readable by the processor 2200 . In the disclosed embodiments, operations performed by the processor 2200 may be implemented by executing instructions or codes of a program stored in the storage 2300 . In addition, data and program command codes corresponding to the attack management module 2310 , the voice assistant module 2320 , and the database 2330 may be stored in the storage 2300 .

일 실시예에서, 프로세서(2200)는 공격 관리 모듈(2310)에 관한 데이터 및 명령어 코드를 이용하여, 입력된 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단하고, 입력된 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호에 해당하는 경우 음성 공격의 생성을 방어하기 위한 동작을 수행할 수 있다.In an embodiment, the processor 2200 determines whether the input audio signal is an audio signal for an attack on the voice assistant service using the data and the command code related to the attack management module 2310, and determines whether the input audio signal is an audio signal for an attack on the voice assistant service. When the signal corresponds to an audio signal for an attack on the voice assistant service, an operation for preventing the generation of a voice attack may be performed.

일 실시예에 따른 공격 관리 모듈(2310)은 오디오 비교 모듈(2311), 텍스트 비교 모듈(2312), 공격 판단 모듈(2313) 및 공격 대응 모듈(2314)을 포함할 수 있다. The attack management module 2310 according to an embodiment may include an audio comparison module 2311 , a text comparison module 2312 , an attack determination module 2313 , and an attack response module 2314 .

프로세서(2200)는 오디오 비교 모듈(2311)을 이용하여, 입력된 오디오 신호를 데이터베이스(2330)에 기저장된 사용자의 오디오 신호들과 비교하여, 기저장된 오디오 신호들 중에서 입력된 오디오 신호와 유사한 사용자의 유사 오디오 신호들을 식별할 수 있다.The processor 2200 compares the input audio signal with the user's audio signals pre-stored in the database 2330 using the audio comparison module 2311, and compares the input audio signal with the user's audio signal similar to the input audio signal among the pre-stored audio signals. Similar audio signals may be identified.

프로세서(2200)는 오디오-텍스트 차이 계산 모듈(2312)을 이용하여, 입력된 오디오 신호로부터 변환된 텍스트를 데이터베이스(2330)에 기저장된 사용자의 복수의 오디오 신호들로부터 변환된 텍스트들과 비교하여 텍스트들 간의 차이 값들을 산출할 수 있다.The processor 2200 compares the text converted from the input audio signal with texts converted from a plurality of audio signals of the user stored in the database 2330 using the audio-text difference calculation module 2312 to compare the text The difference values between them can be calculated.

프로세서(2200)는 공격 판단 모듈(2313)을 이용하여, 텍스트 비교 모듈(2312)로부터 산출된 텍스트들 간의 차이 값들에 기초하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단할 수 있다. 구체적으로, 소정 기간 동안 유사한 오디오 신호가 소정 횟수 이상 입력됨에 따라, 오디오 비교 모듈(2311)에서 식별된 유사 오디오 신호가 소정 개수 이상 존재할 수 있다. 또한, 식별된 유사 오디오 신호들 간 텍스트의 차이가 존재하여 유사한 오디오 신호들로부터 변환되는 텍스트의 변화 정도가 소정 수치 이상일 수 있다. 프로세서(2200)는 공격 판단 모듈(2312)을 이용하여, 유사한 오디오 신호가 소정 횟수 이상 입력되고, 유사 오디오 신호들로부터 변환된 텍스트들의 변형(variation)이 일정 수치 이상으로 식별되면, 적대적 사용자(150)에 의한 음성 공격 생성 단계(100)가 진행중인 것으로 판단할 수 있다. 이 경우, 프로세서(2200)는 현재 입력 오디오 신호를 적대적 사용자(150)에 의한 음성 공격 생성 단계(100)에 포함되는 오디오 신호로 판단할 수 있다.The processor 2200 uses the attack determination module 2313 to determine whether the input audio signal is an audio signal for an attack on the voice assistant service based on difference values between texts calculated from the text comparison module 2312 can do. Specifically, as similar audio signals are input a predetermined number of times or more during a predetermined period, a predetermined number or more of similar audio signals identified by the audio comparison module 2311 may exist. Also, since there is a difference in text between the identified similar audio signals, the degree of change in text converted from the similar audio signals may be greater than or equal to a predetermined value. The processor 2200 uses the attack determination module 2312 to input a similar audio signal a predetermined number of times or more, and when a variation of texts converted from the similar audio signals is identified as a predetermined value or more, the hostile user 150 ), it can be determined that the voice attack generation step 100 is in progress. In this case, the processor 2200 may determine the current input audio signal as an audio signal included in the voice attack generation step 100 by the hostile user 150 .

프로세서(2200)는 공격 대응 모듈(2314)을 이용하여, 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호로 판단되는 경우, 음성 공격의 생성을 방어하기 위한 동작을 수행할 수 있다. When it is determined that the input audio signal is an audio signal for an attack on the voice assistant service, the processor 2200 may perform an operation for preventing the generation of a voice attack using the attack response module 2314 .

프로세서(2200)가 음성 공격 생성 단계를 탐지하고, 대응하기 위해 공격 관리 모듈(2310)에 포함되는 각각의 모듈들을 이용하는 구체적인 동작들에 대해서는 후술하기로 한다.Specific operations of using each module included in the attack management module 2310 in order for the processor 2200 to detect and respond to a voice attack generation step will be described later.

일 실시예에서, 프로세서(2200)는 보이스 어시스턴트 모듈(2320)에 관한 데이터 및 명령어 코드를 이용하여, 입력된 오디오 신호를 텍스트로 변환하고, 변환된 텍스트를 분석하여 디바이스 제어를 위한 데이터 및 명령어 코드들을 생성할 수 있다. 이 경우, 제어되는 디바이스는 클라이언트 디바이스(1000) 및 주변 디바이스들(3000)일 수 있다.In an embodiment, the processor 2200 converts an input audio signal into text using data and command codes related to the voice assistant module 2320 , and analyzes the converted text to control the device using data and command codes can create In this case, the controlled device may be the client device 1000 and the peripheral devices 3000 .

보이스 어시스턴트 모듈(2321)은 ASR 모델(2321), NLU 모델(2322), NLG 모델(2323) 및 디바이스 제어 정보 생성 모듈(2324)을 포함할 수 있다. The voice assistant module 2321 may include an ASR model 2321 , an NLU model 2322 , an NLG model 2323 , and a device control information generating module 2324 .

프로세서(2200)는 ASR(Automatic Speech Recognition) 모델(2321)을 이용하여, ASR을 수행하고, 클라이언트 디바이스(1000)로부터 수신한 오디오 신호를 입력 텍스트로 변환할 수 있다. ASR 모델(2321)은, 음향 모델(acoustic model; AM) 또는 언어 모델(language model; LM) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The processor 2200 may perform ASR using the Automatic Speech Recognition (ASR) model 2321 and may convert the audio signal received from the client device 1000 into input text. The ASR model 2321 may include an acoustic model (AM) or a language model (LM), but is not limited thereto.

프로세서(2200)는 NLU 모델(2322)을 이용하여, ASR 모델을 통해 획득된 텍스트를 해석함으로써 텍스트 내의 의미 요소를 획득할 수 있다.The processor 2200 may obtain a semantic element in the text by using the NLU model 2322 to interpret the text obtained through the ASR model.

프로세서(2200)는 NLG 모델(2323)을 이용하여, 클라이언트 디바이스(1000) 및 주변 디바이스들(3000)의 기능들과 관련된 발화 데이터를 생성하고 편집하여 사용자와 대화하기 위한 자연어를 생성할 수 있다.The processor 2200 may generate and edit speech data related to functions of the client device 1000 and the peripheral devices 3000 by using the NLG model 2323 to generate a natural language for conversation with the user.

프로세서(2200)는 디바이스 제어 정보 생성 모듈(2324)을 이용하여, NLU 모델(2322)에서 획득된 텍스트 내 의미 요소들에 기초하여 클라이언트 디바이스(1000) 및 주변 디바이스들을 제어하기 위한 제어 정보를 생성할 수 있다.The processor 2200 uses the device control information generation module 2324 to generate control information for controlling the client device 1000 and peripheral devices based on semantic elements in the text obtained from the NLU model 2322. can

일 실시예에 따른 저장부(2300)에 포함되는 데이터베이스(2330)는, 사용자의 오디오 신호들이 저장되거나, 사용자의 오디오 신호들이 다양한 전처리 방법을 이용하여 변환된 데이터 형태로 저장된 사용자 오디오 신호 데이터베이스(2331), 사용자의 오디오로부터 변환된 텍스트들이 저장된 사용자 텍스트 데이터베이스(2332)를 포함할 수 있다. 각각의 데이터베이스에 저장되는 데이터들은, 사용자의 오디오 신호가 입력 될 때, 오디오 신호, 오디오 신호로부터 변환된 데이터, 오디오 신호로부터 변환된 텍스트 중 적어도 하나의 형태로 저장되는 것일 수 있다.The database 2330 included in the storage unit 2300 according to an embodiment is a user audio signal database 2331 in which the user's audio signals are stored or the user's audio signals are stored in the form of data converted using various pre-processing methods. ), and a user text database 2332 in which texts converted from the user's audio are stored. Data stored in each database may be stored in the form of at least one of an audio signal, data converted from the audio signal, and text converted from the audio signal when a user's audio signal is input.

도 3은 일 실시예에 따른 서버가 디바이스에 대한 음성 공격을 탐지하기 위해 동작하는 방법들을 도시한 흐름도이다.3 is a flowchart illustrating methods in which a server operates to detect a voice attack on a device according to an embodiment.

도 3을 설명함에 있어서, 도 1a, 1b 및 도 2에서와 동일한 구성은 동일한 도면 부호를 이용하여 설명하기로 한다.In the description of FIG. 3 , the same components as in FIGS. 1A, 1B and 2 will be described using the same reference numerals.

일 실시예에서, 입력 오디오 신호는 보이스 어시스턴트 서비스를 사용하고자하는 일반 사용자로부터 발화되어 클라이언트 디바이스(1000)로 수신된 음성 명령 신호일 수 있다. 또한, 입력 오디오 신호는 음성 공격을 생성하고자하는 적대적 사용자(150)에 의해 생성된 공격 오디오(Adversarial audio)가 스피커 등에서 출력되어, 클라이언트 디바이스(1000)에 의해 음성 명령으로 인식된 오디오 신호일 수 있다. 개시된 실시예에 따른 서버(2000)는, 사용자가 복수인 경우 각각의 사용자를 구별할 수는 있으나, 입력 오디오 신호가 음성 공격을 위한 오디오 신호인지 여부를 판단하기 전에는 사용자가 일반 사용자인지 또는 적대적 사용자(150)인지 여부를 알 수 없으므로, 이하에서는 입력 오디오 신호를 사용자의 입력 오디오 신호로 지칭하기로 한다.In an embodiment, the input audio signal may be a voice command signal uttered by a general user who wants to use the voice assistant service and received by the client device 1000 . In addition, the input audio signal may be an audio signal recognized as a voice command by the client device 1000 by outputting an adversarial audio generated by the hostile user 150 who wants to generate a voice attack through a speaker or the like. The server 2000 according to the disclosed embodiment may distinguish each user when there are a plurality of users, but before determining whether the input audio signal is an audio signal for a voice attack, whether the user is a general user or a hostile user Since it is not known whether it is (150), the input audio signal will be referred to as the user's input audio signal hereinafter.

단계 S310에서, 서버(2000)는 클라이언트 디바이스(1000)로부터 사용자의 입력 오디오 신호를 수신할 수 있다. 입력 오디오 신호는 클라이언트 디바이스(1000)를 호출하기 위한 호출 명령(wake up command) 및 클라이언트 디바이스(1000) 및 주변 기기들(3000) 중 적어도 하나를 제어하기 위한 제어 명령을 포함할 수 있으나, 이에 한정되는 것은 아니다.In operation S310 , the server 2000 may receive the user's input audio signal from the client device 1000 . The input audio signal may include a wake up command for calling the client device 1000 and a control command for controlling at least one of the client device 1000 and the peripheral devices 3000 , but is limited thereto. it is not going to be

일 실시예에 따른 서버(2000)는 입력 오디오 신호를 수신하여 수신된 오디오 신호를 컴퓨터로 판독 가능한 텍스트로 변환하는 ASR을 수행할 수 있다.The server 2000 according to an embodiment may receive an input audio signal and perform ASR for converting the received audio signal into computer-readable text.

단계 S320에서, 일 실시예에 따른 서버(2000)는 수신된 입력 오디오 신호를 서버 내 데이터베이스(2330)에 기저장된 복수의 오디오 신호들 각각과 비교할 수 있다.In operation S320, the server 2000 according to an embodiment may compare the received input audio signal with each of a plurality of audio signals previously stored in the database 2330 within the server.

데이터베이스(2330)에 기저장된 복수의 오디오 신호들은 클라이언트 디바이스(1000)의 사용자에 의한 오디오 입력들로부터 생성된 것들일 수 있다. 또한, 사용자가 복수인 경우 기저장된 복수의 오디오 신호들은 각각의 사용자 별로 각각 구분되어 저장되어 있을 수 있다. 서버(2000)에 복수의 사용자 별로 각각 기저장된 복수의 오디오 신호들이 구분되어 저장된 경우, 입력 오디오 신호는 기저장된 복수의 오디오 신호들 중 입력 오디오 신호에 대응되는 사용자의 기저장된 오디오 신호들과 비교된다. 예를 들어, 입력 오디오 신호가 적대적 사용자(150)의 오디오 신호인 경우, 입력 오디오 신호와 비교되는 기저장된 복수의 오디오 신호들은 적대적 사용자(150)의 오디오 신호들을 저장한 신호들일 수 있다.The plurality of audio signals previously stored in the database 2330 may be generated from audio inputs by a user of the client device 1000 . Also, when there are a plurality of users, a plurality of pre-stored audio signals may be stored separately for each user. When a plurality of pre-stored audio signals for each user are separately stored in the server 2000, the input audio signal is compared with the user's pre-stored audio signals corresponding to the input audio signal among the plurality of pre-stored audio signals . For example, when the input audio signal is the audio signal of the hostile user 150 , the plurality of pre-stored audio signals compared with the input audio signal may be signals obtained by storing the audio signals of the hostile user 150 .

또한, 기저장된 복수의 오디오 신호들이 저장될 때 조절 파라미터를 이용하여 저장되는 오디오 신호들의 양이 조절될 수 있다. 조절 파라미터는, 데이터베이스(2330)의 데이터들을 효율적으로 관리하기 위한 파라미터로, 오디오 신호를 저장하는 기간 및 오디오 신호를 저장하는 데이터 형태 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.Also, when a plurality of pre-stored audio signals are stored, the amount of stored audio signals may be adjusted using an adjustment parameter. The adjustment parameter is a parameter for efficiently managing data of the database 2330 and may include a period for storing an audio signal and a data type for storing the audio signal, but is not limited thereto.

예를 들어, 데이터베이스(2330)에 기저장된 복수의 오디오 신호들은 일주일의 기간 동안 사용자가 클라이언트 디바이스(1000)에 명령한 오디오 신호들이 저장된 것들일 수 있다. 또한, 클라이언트 디바이스(1000)의 사용자가 복수인 경우, 데이터베이스(2330)에 기저장된 복수의 오디오 신호들은 각각의 사용자 별로 조절 파라미터가 다르게 적용됨으로써 각각의 사용자 별로 오디오 신호들이 구분되어 저장된 것들일 수 있다. 또한, 데이터베이스(2330)에 기저장된 복수의 오디오 신호들은, 소정의 알고리즘(예를 들어, 정규화)이 적용되어 원본 형태가 아닌 변환된 데이터 형태로 저장될 수 있다.For example, the plurality of audio signals pre-stored in the database 2330 may be audio signals commanded by the user to the client device 1000 for a period of one week. In addition, when there are a plurality of users of the client device 1000, the plurality of audio signals pre-stored in the database 2330 may be stored separately for each user by applying different adjustment parameters to each user. . Also, the plurality of audio signals pre-stored in the database 2330 may be stored in a converted data form instead of an original form by applying a predetermined algorithm (eg, normalization).

일 실시예에 따른 서버(2000)는, 수신된 입력 오디오 신호와 유사한 오디오 신호들이 소정 기간 동안 소정 횟수 이상 입력되었는지 여부를 판단하기 위하여, 수신된 입력 오디오 신호를 데이터베이스(2330) 내 기저장된 복수의 오디오 신호들과 각각과 비교하여 입력 오디오 신호와 기저장된 복수의 오디오 신호들 사이의 유사도를 계산할 수 있다. 서버(2000)가 수신된 입력 오디오 신호와 데이터베이스(2330)에 기저장된 복수의 오디오 신호들을 비교하는 구체적인 방법에 대해서는 도 4a 및 4b에 대한 설명에서 상세하게 서술하기로 한다.The server 2000 according to an embodiment may store the received input audio signal in a plurality of pre-stored in the database 2330 in order to determine whether audio signals similar to the received input audio signal are input more than a predetermined number of times during a predetermined period. The similarity between the input audio signal and the plurality of pre-stored audio signals may be calculated by comparing the respective audio signals. A detailed method by which the server 2000 compares the received input audio signal with a plurality of audio signals previously stored in the database 2330 will be described in detail with reference to FIGS. 4A and 4B .

단계 S330에서, 일 실시예에 따른 서버(2000)는 입력 오디오 신호와 기저장된 복수의 오디오 신호들의 비교 결과에 기초하여, 기저장된 복수의 오디오 신호들 중에서 입력 오디오 신호와 유사한 오디오 신호들을 식별할 수 있다. 구체적으로, 서버(2000)는 입력 오디오 신호와 기저장된 복수의 오디오 신호들을 비교한 결과, 기저장된 복수의 오디오 신호들 중에서 입력 오디오 신호와 유사도가 소정 임계치 이상인 오디오 신호들을 유사 오디오 신호들로 식별할 수 있다. In step S330, the server 2000 according to an embodiment may identify audio signals similar to the input audio signal from among the plurality of pre-stored audio signals based on the comparison result of the input audio signal and the plurality of pre-stored audio signals. have. Specifically, as a result of comparing the input audio signal with the plurality of pre-stored audio signals, the server 2000 identifies audio signals having a similarity greater than or equal to a predetermined threshold among the plurality of pre-stored audio signals as similar audio signals. can

식별된 유사 오디오 신호들은, 입력 오디오 신호가 입력되기 이전에 입력 오디오 신호와 유사한 오디오 신호들이 클라이언트 디바이스(1000)를 통해 서버(2000)로 수신되어 저장되었음을 의미할 수 있다. 따라서, 서버(2000)는 입력 오디오 신호가 수신되기 이전에 입력 오디오 신호와 유사한 오디오 신호들이 얼마나 입력되어 서버(2000)에 저장되었는지 식별할 수 있다.The identified similar audio signals may mean that audio signals similar to the input audio signal are received and stored by the server 2000 through the client device 1000 before the input audio signal is input. Accordingly, the server 2000 may identify how many audio signals similar to the input audio signal are input and stored in the server 2000 before the input audio signal is received.

또한, 일 실시예에 따른 서버(2000)가 기저장된 복수의 오디오 신호들 중에서 유사 오디오 신호들을 식별하는 기준인 임계치는, 유사 오디오 신호들의 개수, 기저장된 복수의 오디오 신호들이 저장된 기간, 기저장된 복수의 오디오 신호들이 저장된 데이터 형태에 중 적어도 하나에 기초하여 변경될 수 있다. In addition, the threshold, which is a criterion for which the server 2000 according to an embodiment identifies similar audio signals from among a plurality of pre-stored audio signals, includes the number of similar audio signals, a period in which the plurality of pre-stored audio signals are stored, and a plurality of pre-stored audio signals. of the audio signals may be changed based on at least one of the stored data types.

또한, 기저장된 복수의 오디오 신호들 중에서, 입력 오디오 신호와의 유사도가 소정 임계치 이상인 유사 오디오 신호들의 수가 기준치인 N 보다 많은 경우, 서버(2000)는 식별된 유사 오디오 신호들 중에서 입력 오디오 신호와 유사도가 가장 높은 N개의 신호들을 유사 오디오 신호들로 식별할 수 있다.Also, when the number of similar audio signals having a similarity with the input audio signal equal to or greater than a predetermined threshold among the plurality of pre-stored audio signals is greater than the reference value N, the server 2000 determines the similarity to the input audio signal from among the identified similar audio signals. The N signals having the highest n may be identified as similar audio signals.

단계 S340에서, 일 실시예에 따른 서버(2000)는 ASR(Automatic Speech Recognition)을 수행하여, 수신된 입력 오디오 신호를 제1 텍스트로 변환할 수 있다. 구체적으로, 서버(2000)는 음향 모델(acoustic model; AM) 또는 언어 모델(language model; LM) 등 기 정의된 모델을 이용하여 오디오 신호를 컴퓨터로 판독 가능한 텍스트로 변환하는 ASR을 수행할 수 있다. 또한, 서버(2000)가 클라이언트 디바이스로부터 노이즈가 제거되지 않은 음향 신호를 수신할 경우에는, 수신된 음향 신호에서 노이즈를 제거하여 오디오 신호를 획득하고, 오디오 신호에 대하여 ASR을 수행할 수 있다.In operation S340, the server 2000 according to an embodiment may perform automatic speech recognition (ASR) to convert the received input audio signal into the first text. Specifically, the server 2000 may perform ASR for converting an audio signal into computer-readable text using a predefined model such as an acoustic model (AM) or a language model (LM). . Also, when the server 2000 receives an acoustic signal from which noise is not removed from the client device, the server 2000 may obtain an audio signal by removing noise from the received acoustic signal and perform ASR on the audio signal.

또한, 일 실시예에 따른 서버(2000)는 유사 오디오 신호들로부터 변환된, 제2 텍스트들을 획득할 수 있다.Also, the server 2000 according to an embodiment may obtain the converted second texts from the similar audio signals.

일 실시예에서, 유사 오디오 신호들로부터 변환된 제2 텍스트들은, 서버(2000)가 단계 S330에서 식별된 유사 오디오 신호들에 대하여 ASR을 수행하여, 식별된 유사 오디오 신호들을 텍스트들로 변환함으로써 획득된 것일 수 있다. In an embodiment, the second texts converted from the analogous audio signals are obtained by the server 2000 performing ASR on the analogous audio signals identified in step S330 and converting the identified analogous audio signals into texts. it may have been

다른 실시예에서, 복수의 기저장된 오디오 신호들이 데이터베이스(2330)에 저장될 때, 서버(2000)에 의해 저장되는 오디오 신호들에 대해 ASR이 수행되어 복수의 오디오 신호들로부터 변환된 제3 텍스트들을 함께 저장될 수 있다. 이 경우, 유사 오디오 신호들로부터 변환된 제2 텍스트들은, 서버(2000)가 제3 텍스트들 중에서 식별된 유사 오디오 신호들에 대응되는 텍스트들을 선택함으로써 획득된 것일 수 있다.In another embodiment, when a plurality of pre-stored audio signals are stored in the database 2330 , ASR is performed on the audio signals stored by the server 2000 to generate third texts converted from the plurality of audio signals. can be stored together. In this case, the second texts converted from the analogous audio signals may be obtained by the server 2000 selecting texts corresponding to the identified analogous audio signals from among the third texts.

일 실시예에 따른 서버(2000)는 입력 오디오 신호로부터 변환된 제1 텍스트와 유사 오디오 신호들로부터 변환된 제2 텍스트들이 획득되면, 단계 S350을 수행할 수 있다.When the first text converted from the input audio signal and the second text converted from the similar audio signals are obtained, the server 2000 according to an embodiment may perform step S350.

단계 S350에서, 일 실시예에 따른 서버(2000)는 입력 오디오 신호 및 유사 오디오 신호들에 대하여, 오디오 신호들로부터 변환된 텍스트들의 변형이 발생하는지 판단하기 위해 입력 오디오 신호로부터 변환된 텍스트를 유사 오디오 신호들로부터 변환된 텍스트들과 비교할 수 있다.In step S350, the server 2000 according to an embodiment converts the text converted from the input audio signal into the analogous audio signal to determine whether the text converted from the audio signals is transformed with respect to the input audio signal and the analogous audio signals. It can be compared with the texts converted from the signals.

일 실시예에서, 서버(2000)는 수신된 입력 오디오 신호로부터 변환된 제1 텍스트와 유사 오디오 신호들로부터 변환된 제2 텍스트들을 비교할 수 있다. 구체적으로, 서버(2000)는 제1 텍스트를 제2 텍스트들에 포함되는 텍스트들 각각과 비교하여 제1 텍스트와 제2 텍스트들 사이의 유사도를 계산할 수 있다. 예를 들어, 서버(2000)는 제1 텍스트와 제2 텍스트들의 문자 간 차이, 단어 간 차이 및 발음 표현 간 차이 중 적어도 하나에 기초하여 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들을 산출함으로써 유사도를 계산할 수 있다.In an embodiment, the server 2000 may compare the first text converted from the received input audio signal with the second text converted from the similar audio signals. Specifically, the server 2000 may calculate a similarity between the first text and the second texts by comparing the first text with each of the texts included in the second texts. For example, the server 2000 may generate difference values indicating a difference between the first text and the second texts based on at least one of a difference between characters of the first text and the second text, a difference between words, and a difference between pronunciation expressions. The similarity can be calculated by calculating.

서버(2000)가 제1 텍스트와 제2 텍스트들을 비교하는 구체적인 방법에 대해서는 도 5에 대한 설명에서 상세하게 서술하기로 한다.A detailed method for the server 2000 to compare the first text and the second text will be described in detail with reference to FIG. 5 .

단계 S360에서, 일 실시예에 따른 서버(2000)는 제1 텍스트와 제2 텍스트를 비교한 결과에 기초하여, 입력 오디오 신호가 상기 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지를 판단할 수 있다. 예를 들어, 기저장된 오디오 신호들 중 유사한 오디오 신호들이 식별되어 입력 오디오 신호와 유사한 유사 오디오 신호들이 소정 개수 이상 존재하면서도, 입력 오디오 신호로부터 변환된 텍스트는 변형되어 있어 제1 텍스트가 제2 텍스트들과 차이가 있을 수 있다. 이 경우, 입력된 오디오 신호는 도 1a 및 1b에서 전술한 바와 같이 적대적 사용자(150)가 유사 오디오 신호들을 변조하여 입력하면서 유사 오디오 신호들로부터 변환된 텍스트들을 확인하기 위한 과정인, 음성 공격 생성 단계에 포함되는 오디오 신호인 것을 의미할 수 있다.In step S360, the server 2000 according to an embodiment may determine whether the input audio signal is an audio signal for attacking the voice assistant service based on a result of comparing the first text and the second text . For example, similar audio signals are identified from among the pre-stored audio signals so that a predetermined number or more of similar audio signals similar to the input audio signal exist. may be different from In this case, as described above with reference to FIGS. 1A and 1B , the input audio signal is a process for confirming texts converted from the similar audio signals while the hostile user 150 modulates and inputs the similar audio signals, a voice attack generation step It may mean that it is an audio signal included in .

일 실시예에 따른 서버(2000)는 입력 오디오 신호로부터 변환된 텍스트가 기저장된 유사 오디오 신호들로부터 변환된 텍스트와 상이한 정도에 기초하여, 사용자의 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지를 판단할 수 있다. 즉, 서버(2000)는 입력 오디오 신호를 입력한 사용자가 적대적 사용자(150)인지 판단할 수 있다.The server 2000 according to an embodiment determines that the user's input audio signal is audio for attack on the voice assistant service based on the degree to which the text converted from the input audio signal is different from the text converted from the pre-stored similar audio signals. signal can be determined. That is, the server 2000 may determine whether the user who has input the input audio signal is the hostile user 150 .

구체적으로, 서버(2000)는 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들 중에서, 제1 텍스트와 제2 텍스트들 각각의 차이 값이 제1 임계값 이상인 값들을 식별하고, 차이 값이 제1 임계값 이상인 값들의 수가 제2 임계값 이상인 경우, 입력된 오디오 신호를 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호로 판단할 수 있다.Specifically, the server 2000 identifies values in which the difference value between the first text and the second text is equal to or greater than the first threshold value among the difference values indicating the difference between the first text and the second text, and the difference value is equal to or greater than the first threshold value. When the number of values equal to or greater than the first threshold value is equal to or greater than the second threshold value, the input audio signal may be determined as an audio signal for attacking the voice assistant service.

또한, 제1 임계값 및 제2 임계값은 식별된 유사 오디오 신호들의 개수에 따라 설정될 수 있다. 예를 들어, 식별된 유사 오디오 신호들이 소정 기준치 A보다 많은 경우, 텍스트들 간의 차이를 나타내는 값들이 제1 임계값보다 큰 값들이 다수 식별됨에 따라 연산의 효율성이 저하될 수 있다. 이 경우, 제1 임계값은 현재 값보다 더 높은 값으로 설정될 수 있다. 다른 예에서, 식별된 유사 오디오 신호들이 소정 기준치 B보다 적은 경우, 텍스트들 간의 차이를 나타내는 값들이 제1 임계값보다 큰 값들이 존재함에도 불구하고 차이 값이 제1 임계값 이상인 값들의 수가 제2 임계값보다 작을 수 있다. 이 경우, 신속한 판단을 위해서 제2 임계값은 현재 값보다 낮은 값으로 설정될 수 있다.Also, the first threshold value and the second threshold value may be set according to the number of identified similar audio signals. For example, when the number of identified similar audio signals is greater than the predetermined reference value A, a plurality of values representing differences between texts greater than a first threshold value are identified, so that the efficiency of calculation may be reduced. In this case, the first threshold value may be set to a value higher than the current value. In another example, when the identified similar audio signals are less than the predetermined reference value B, the number of values having a difference value equal to or greater than the first threshold value is the second despite the existence of values indicating the difference between texts that are greater than the first threshold value. may be less than the threshold. In this case, for quick determination, the second threshold value may be set to a value lower than the current value.

또한, 서버(2000)는 텍스트들의 차이에 기초하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단할 때, 입력 오디오 신호와 기저장된 복수의 오디오 신호들의 비교 결과를 더 이용할 수 있다.In addition, when determining whether the input audio signal is an audio signal for attacking the voice assistant service based on the difference between the texts, the server 2000 may further use the comparison result between the input audio signal and a plurality of pre-stored audio signals. can

또한, 서버(2000)는 텍스트들의 차이에 기초하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 판단할 때, 제2 텍스트들간의 차이 값들을 더 이용할 수 있다. 예를 들어, 서버(2000)는 식별된 유사 오디오 신호들로부터 변환된 제2 텍스트들을 비교하여, 제2 텍스트들간의 차이 값들을 계산함으로써, 유사 오디오 신호들로부터 변환된 텍스트들이 변형되는 정도를 더 이용하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 판단할 수 있다.Also, when determining whether the input audio signal is an audio signal for attacking the voice assistant service based on the difference between the texts, the server 2000 may further use difference values between the second texts. For example, the server 2000 compares the second texts converted from the identified analogous audio signals and calculates difference values between the second texts to further determine the degree to which texts converted from the analogous audio signals are transformed. It can be used to determine whether the input audio signal is an audio signal for an attack on the voice assistant service.

일 실시예에 따른 서버(2000)는 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호로 판단되는 경우, 음성 공격의 생성을 방어하기 위한 대응 동작을 결정할 수 있다. 이에 대해서는 도 7에 대한 설명에서 상세하게 서술하기로 한다.When it is determined that the input audio signal is an audio signal for an attack on the voice assistant service, the server 2000 according to an embodiment may determine a corresponding action for preventing the generation of a voice attack. This will be described in detail in the description of FIG. 7 .

도 4a 및 4b는 일 실시예에 따른 서버가 오디오 비교 모듈을 이용하여 입력 오디오 신호와 기저장된 복수의 오디오 신호들을 비교하는 방법을 설명하기 위한 도면이다.4A and 4B are diagrams for explaining a method in which a server compares an input audio signal with a plurality of pre-stored audio signals using an audio comparison module according to an exemplary embodiment.

도 4a 및 4b를 설명함에 있어서, 도 1a, 1b 및 2에서와 동일한 구성은 동일한 도면 부호를 이용하여 설명하기로 한다.In the description of FIGS. 4A and 4B, the same components as in FIGS. 1A, 1B and 2 will be described using the same reference numerals.

일 실시예에 따른 서버(2000)는 공격 관리 모듈(2310) 내에 오디오 비교 모듈(2311)을 포함할 수 있다. 또한, 오디오 비교 모듈(2311)은 오디오 특징 추출부(420) 및 오디오 유사도 계산부(430)를 포함할 수 있다.The server 2000 according to an embodiment may include an audio comparison module 2311 in the attack management module 2310 . Also, the audio comparison module 2311 may include an audio feature extractor 420 and an audio similarity calculator 430 .

일 실시예에서, 오디오 비교 모듈(2311)은 입력 오디오 신호(410)를 사용자 오디오 신호 데이터베이스(2331)에 기저장된 복수의 오디오 신호들과 비교할 수 있다.In an embodiment, the audio comparison module 2311 may compare the input audio signal 410 with a plurality of audio signals previously stored in the user audio signal database 2331 .

도 4a를 참조하면, 서버(2000)가 획득한 입력 오디오 신호(410)는 오디오 비교 모듈(2311)로 입력될 수 있다. Referring to FIG. 4A , the input audio signal 410 obtained by the server 2000 may be input to the audio comparison module 2311 .

일 실시예에서, 입력 오디오 신호(410)는 스피커로부터 출력되는 적대적 사용자(150)에 의해 생성된 공격 오디오(Adversarial audio)가 클라이언트 디바이스(1000)에 의해 음성 명령을 나타내는 오디오 신호로 인식됨에 따라 서버로 전송된 것일 수 있다.In one embodiment, the input audio signal 410 is a server as the attack audio generated by the hostile user 150 output from the speaker is recognized by the client device 1000 as an audio signal representing a voice command. may have been sent to

구체적으로, 적대적 사용자(150)의 공격 오디오에 클라이언트 디바이스(1000)를 호출하는 음성 명령(예를 들어, "하이 빅스비" 또는 "하이 빅스비, 현관문 열어 줘" 등)이 포함되는 경우, 클라이언트 디바이스(1000)는 호출 명령이 포함된 적대적 사용자(150)의 공격 오디오를 입력 오디오 신호(410)로 인식하여 서버(2000)로 전송할 수 있다. 서버(2000)는 수신된 입력 오디오 신호(410)를 오디오 비교 모듈(2311)로 입력할 수 있다.Specifically, when the attack audio of the hostile user 150 includes a voice command for calling the client device 1000 (eg, "Hi Bixby" or "Hi Bixby, open the front door", etc.), The client device 1000 may recognize the attack audio of the hostile user 150 including the call command as the input audio signal 410 and transmit it to the server 2000 . The server 2000 may input the received input audio signal 410 to the audio comparison module 2311 .

일 실시예에 따른 오디오 비교 모듈(2311)은 입력 오디오 신호(410)가 입력되면, 사용자 오디오 신호 데이터베이스(2331)에 기저장된 복수의 오디오 신호들을 로드(load)할 수 있다. 사용자 오디오 신호 데이터베이스(2331)에 기저장된 복수의 오디오 신호들은, 소정 기간 동안 사용자로부터 입력된 오디오 신호들이 저장된 것일 수 있다. 또한 사용자가 복수인 경우, 기저장된 복수의 오디오 신호들은 복수의 사용자별로 각각 구분되어 저장되어 있을 수 있다. 서버(2000)에 복수의 사용자 별로 각각 기 저장된 복수의 오디오 신호들이 구분되어 저장된 경우, 입력 오디오 신호는 기저장된 복수의 오디오 신호들 중 입력 오디오 신호를 입력한 사용자의 기저장된 오디오 신호들과 비교된다. 예를 들어, 입력 오디오 신호가 적대적 사용자(150)의 오디오 신호인 경우, 입력 오디오 신호는 적대적 사용자(150)의 기저장된 복수의 오디오 신호들과 비교될 수 있다.When the input audio signal 410 is input, the audio comparison module 2311 according to an embodiment may load a plurality of audio signals previously stored in the user audio signal database 2331 . The plurality of audio signals pre-stored in the user audio signal database 2331 may be audio signals input by the user for a predetermined period of time. Also, when there are a plurality of users, a plurality of pre-stored audio signals may be stored separately for each of the plurality of users. When a plurality of pre-stored audio signals for each user are separately stored in the server 2000, the input audio signal is compared with the pre-stored audio signals of the user who input the input audio signal among the plurality of pre-stored audio signals . For example, when the input audio signal is an audio signal of the hostile user 150 , the input audio signal may be compared with a plurality of pre-stored audio signals of the hostile user 150 .

오디오 비교 모듈(2311)은 입력 오디오 신호를 기저장된 복수의 오디오 신호들 각각과 비교할 수 있다. 이 경우, 입력 오디오 신호 및 기저장된 복수의 오디오 신호들은 아날로그 형태의 오디오 신호일 수 있으며, 입력 오디오 신호와 기저장된 복수의 오디오 신호들의 비교는 시간 도메인 또는 주파수 도메인에서 수행될 수 있다. The audio comparison module 2311 may compare the input audio signal with each of a plurality of pre-stored audio signals. In this case, the input audio signal and the plurality of pre-stored audio signals may be analog audio signals, and comparison between the input audio signal and the plurality of pre-stored audio signals may be performed in a time domain or a frequency domain.

오디오 비교 모듈(2311)은 오디오 유사도 계산부(430)를 이용하여 입력 오디오 신호의 파형과 기저장된 복수의 오디오 신호들 각각의 파형을 비교할 수 있다. 예를 들어, 오디오 비교 모듈(2311)은 입력 오디오 신호 및 기저장된 복수의 오디오 신호들을 디지털 신호로 변환하고, 샘플링 및 양자화 과정을 수행한 후 입력 오디오 신호의 파형과 및 기저장된 복수의 오디오 신호들의 파형들 간 유사도를 계산할 수 있다.The audio comparison module 2311 may compare the waveform of the input audio signal with the waveform of each of the plurality of pre-stored audio signals using the audio similarity calculator 430 . For example, the audio comparison module 2311 converts an input audio signal and a plurality of pre-stored audio signals into a digital signal, performs sampling and quantization, and then compares the waveform of the input audio signal and the plurality of pre-stored audio signals. The similarity between the waveforms can be calculated.

또한, 오디오 비교 모듈(2311)은 오디오 특징 추출부(420)를 이용하여 입력 오디오 신호에 전처리 과정을 수행하고 입력 오디오 신호 및 기저장된 복수의 오디오 신호들로부터 특징 벡터들을 추출할 수 있다. 오디오 비교 모듈(2311)은 오디오 유사도 계산부(430)를 이용하여, 추출된 특징 벡터들을 비교하여 입력 오디오 신호와 기저장된 복수의 오디오 신호들의 유사도를 계산할 수 있다.Also, the audio comparison module 2311 may perform a preprocessing process on the input audio signal using the audio feature extractor 420 and extract feature vectors from the input audio signal and a plurality of pre-stored audio signals. The audio comparison module 2311 may calculate the similarity between the input audio signal and a plurality of pre-stored audio signals by comparing the extracted feature vectors using the audio similarity calculator 430 .

일 실시예에서, 오디오 특징 추출부(420) 및 오디오 유사도 계산부(430)가 오디오 신호로부터 특징 벡터들을 추출하고 비교하는 방법은 다양한 방법이 이용될 수 있다. 예를 들어, 상관계수(Correlation Coefficient) 분석, 켑스트럼(Cepstrum) 분석, MFCC 계수(Mel Frequency Cepstral Coefficient) 분석 등의 방법이 이용될 수 있으나, 이에 한정되는 것은 아니다.In an embodiment, various methods may be used for the method for the audio feature extractor 420 and the audio similarity calculator 430 to extract and compare feature vectors from an audio signal. For example, methods such as correlation coefficient analysis, cepstrum analysis, and MFCC coefficient (Mel Frequency Cepstral Coefficient) analysis may be used, but the present invention is not limited thereto.

일 실시예에 따른 오디오 비교 모듈(2311)은, 입력 오디오 신호와 사용자 오디오 신호 데이터베이스(2331)에 기저장된 복수의 오디오 신호들 각각을 비교하여 오디오 신호 간 차이를 나타내는 차이 값들인 유사도를 계산하고, 기저장된 복수의 오디오 신호들 중에서 차이 값이 소정 임계치 이상인 오디오 신호들을 유사 오디오 신호로 식별할 수 있다.The audio comparison module 2311 according to an embodiment compares the input audio signal and each of a plurality of audio signals previously stored in the user audio signal database 2331, and calculates the similarity, which is difference values indicating the difference between the audio signals, Among the plurality of pre-stored audio signals, audio signals having a difference value greater than or equal to a predetermined threshold may be identified as similar audio signals.

또한, 오디오 비교 모듈(2311)은, 식별된 유사 오디오 신호들을 그룹핑하여 유사 오디오 신호 집합(440)을 생성할 수 있다. 이 경우, 식별된 유사 오디오 신호들이 소정 개수 N개보다 많으면, 오디오 비교 모듈(2311)은 유사도가 가장 높은 N개의 유사 오디오 신호를 선택하여 유사 오디오 신호 집합(440)을 생성할 수 있다. 오디오 비교 모듈(2311)은 생성된 유사 오디오 신호 집합(440) 및 입력 오디오 신호(410)를 출력할 수 있다.Also, the audio comparison module 2311 may generate the similar audio signal set 440 by grouping the identified similar audio signals. In this case, if the number of identified similar audio signals is greater than the predetermined number of N, the audio comparison module 2311 may select N similar audio signals having the highest similarity to generate the similar audio signal set 440 . The audio comparison module 2311 may output the generated similar audio signal set 440 and the input audio signal 410 .

다른 실시예에서, 연산의 효율화를 위해, 오디오 비교 모듈(2311)은 오디오 신호들을 다른 데이터 형태로 변환하고, 변환된 데이터들을 비교함으로써 입력 오디오 신호와 기저장된 복수의 오디오 신호들의 유사도를 계산할 수 있다.In another embodiment, for efficiency of operation, the audio comparison module 2311 converts audio signals into different data types and compares the converted data to calculate the similarity between the input audio signal and a plurality of pre-stored audio signals. .

도 4b를 참조하면, 사용자 오디오 신호 데이터베이스(2331)에는 복수의 오디오 데이터들이 저장되어 있을 수 있다. 이 경우, 기저장된 복수의 오디오 데이터들은 사용자 오디오 신호를 나타내는 복수의 오디오 신호들로부터 변환된 데이터들일 수 있다. Referring to FIG. 4B , a plurality of audio data may be stored in the user audio signal database 2331 . In this case, the plurality of pre-stored audio data may be data converted from the plurality of audio signals representing the user audio signal.

예를 들어, 기저장된 복수의 오디오 데이터들은 사용자 오디오 신호 데이터베이스(2331)에 복수의 오디오 신호들이 저장될 때, 복수의 오디오 신호들에 소정의 데이터 처리 방법이 적용되어 변환된 사용자 오디오 데이터들이 사용자 오디오 신호들과 함께 저장된 것일 수 있다.For example, when a plurality of audio signals are stored in the user audio signal database 2331 of the plurality of pre-stored audio data, a predetermined data processing method is applied to the plurality of audio signals to convert the converted user audio data into the user audio data. It may be stored with the signals.

다른 예에서, 기저장된 복수의 오디오 데이터들은 복수의 오디오 신호들이 입력될 때, 복수의 오디오 신호들을 저장하지 않고, 복수의 오디오 신호들에 소정의 데이터 처리 방법이 적용되어 변환된 사용자 오디오 데이터들이 사용자 오디오 신호들 대신 저장된 것일 수 있다.In another example, the plurality of pre-stored audio data does not store the plurality of audio signals when the plurality of audio signals are input, but user audio data converted by applying a predetermined data processing method to the plurality of audio signals It may be stored instead of audio signals.

일 실시예에 따른 기저장된 복수의 오디오 데이터들은 정규화(Normalization), 지역성-보존 해싱(Locality-preserving hashing; LPH), 클러스터링(Clustering) 중 적어도 하나의 방법을 이용하여 변환된 형태의 데이터일 수 있다. 다만, 기저장된 복수의 오디오 데이터들의 데이터 형태들은 전술한 예시에 한정되는 것은 아니며, 연산의 복잡도를 줄이기 위해 기저장된 복수의 오디오 데이터들은 다양한 데이터 형태로 저장될 수 있다.The plurality of pre-stored audio data according to an embodiment may be data in a converted form using at least one method of normalization, locality-preserving hashing (LPH), and clustering. . However, the data types of the plurality of pre-stored audio data are not limited to the above-described examples, and the plurality of pre-stored audio data may be stored in various data types in order to reduce the complexity of the operation.

일 실시예에 따른 입력 오디오 데이터(450)는 오디오 비교 모듈(2311)로 입력될 수 있다. 이 경우, 입력 오디오 신호(410)에 적용되는 소정의 데이터 처리 방법은, 사용자 오디오 신호 데이터베이스(2331)에 기저장된 복수의 오디오 데이터들에 적용된 데이터 처리 방법과 대응되는 방법일 수 있다. The input audio data 450 according to an embodiment may be input to the audio comparison module 2311 . In this case, the predetermined data processing method applied to the input audio signal 410 may be a method corresponding to the data processing method applied to the plurality of audio data previously stored in the user audio signal database 2331 .

오디오 비교 모듈(2311)은 오디오 유사도 계산부(430)를 이용하여 입력 오디오 데이터와 기저장된 복수의 오디오 데이터들 각각을 비교하여 유사 오디오 데이터들을 식별할 수 있다. 또한 사용자가 복수인 경우, 기저장된 복수의 오디오 데이터들은 복수의 사용자별로 각각 구분되어 저장되어 있을 수 있다. 서버(2000)에 복수의 사용자 별로 각각 기 저장된 복수의 오디오 데이터들이 구분되어 저장된 경우, 입력 오디오 데이터는 기저장된 복수의 오디오 신호들 중 입력 오디오 데이터를 입력한 사용자의 기저장된 오디오 데이터들과 비교된다. 예를 들어, 입력 오디오 신호가 적대적 사용자(150)의 오디오 신호인 경우, 입력 오디오 신호는 적대적 사용자(150)의 기저장된 복수의 오디오 신호들과 비교될 수 있다.The audio comparison module 2311 may identify similar audio data by comparing the input audio data with each of a plurality of pre-stored audio data using the audio similarity calculator 430 . Also, when there are a plurality of users, a plurality of pre-stored audio data may be stored separately for each of the plurality of users. When a plurality of pre-stored audio data for each user are separately stored in the server 2000, the input audio data is compared with the pre-stored audio data of the user who input the input audio data among the plurality of pre-stored audio signals. . For example, when the input audio signal is an audio signal of the hostile user 150 , the input audio signal may be compared with a plurality of pre-stored audio signals of the hostile user 150 .

일 실시예에서, 입력 오디오 데이터(450) 및 기저장된 복수의 오디오 데이터들은 정규화된 파형 데이터일 수 있다. 오디오 비교 모듈(2311)은 정규화된 파형 데이터들의 피크값을 맞추기 위한 피크 얼라이먼트 알고리즘(Peak alignment algorithm)을 적용하고, 피크값이 일치되도록 조정된 파형들의 교차상관관계(Cross-correlation)를 측정하여 입력 오디오 데이터(450)와 기저장된 복수의 오디오 데이터들의 유사도를 계산할 수 있다. 오디오 비교 모듈(2311)은 유사도 계산 결과에 기초하여, 입력 오디오 데이터와 유사한 유사 오디오 데이터들을 식별할 수 있다.In an embodiment, the input audio data 450 and the plurality of pre-stored audio data may be normalized waveform data. The audio comparison module 2311 applies a peak alignment algorithm for matching the peak values of the normalized waveform data, and measures the cross-correlation of the adjusted waveforms to match the peak values and inputs A similarity between the audio data 450 and a plurality of pre-stored audio data may be calculated. The audio comparison module 2311 may identify similar audio data similar to the input audio data based on the similarity calculation result.

다른 실시예에서, 입력 오디오 데이터(450) 및 기저장된 복수의 오디오 데이터들은 지역성-보존 해싱(Locality-preserving hashing; LPH) 알고리즘이 적용된 해시 값들일 수 있다. 오디오 비교 모듈(2311)은 입력 오디오 데이터(450)를 나타내는 해시 값과 기저장된 복수의 오디오 데이터들을 나타내는 해시 값들을 비교하여 입력 오디오 데이터(450)와 기저장된 복수의 오디오 데이터들의 유사도를 계산할 수 있다. 오디오 비교 모듈(2311)은 유사도 계산 결과에 기초하여, 입력 오디오 데이터와 유사한 유사 오디오 데이터들을 식별할 수 있다.In another embodiment, the input audio data 450 and the plurality of pre-stored audio data may be hash values to which a locality-preserving hashing (LPH) algorithm is applied. The audio comparison module 2311 may calculate a similarity between the input audio data 450 and the plurality of pre-stored audio data by comparing a hash value representing the input audio data 450 with hash values representing a plurality of pre-stored audio data. . The audio comparison module 2311 may identify similar audio data similar to the input audio data based on the similarity calculation result.

다른 실시예에서, 입력 오디오 데이터(450) 및 기저장된 복수의 오디오 데이터들은 클러스터링(Clustering) 알고리즘이 적용되어 유사한 개체들끼리 군집화 되어있는 데이터들일 수 있다. 이 경우, 동일 클러스터 내 개체들은 유사도가 계산된 후 동일한 클러스터로 분류된 것이므로, 오디오 비교 모듈(2311)은 입력 오디오 데이터(450)에 클러스터링 알고리즘을 적용하고, 입력 데이터(450)가 배정되는 클러스터 내 개체들을 식별함으로써 유사 오디오 데이터들을 식별할 수 있다.In another embodiment, the input audio data 450 and the plurality of pre-stored audio data may be data in which a clustering algorithm is applied to cluster similar entities. In this case, since objects within the same cluster are classified into the same cluster after the similarity is calculated, the audio comparison module 2311 applies a clustering algorithm to the input audio data 450 and within the cluster to which the input data 450 is assigned. Similar audio data can be identified by identifying the entities.

일 실시예에 따른 오디오 비교 모듈(2311)은, 식별된 유사 오디오 데이터들을 그룹핑하여 유사 오디오 데이터 집합(460)을 생성할 수 있다. 이 경우, 식별된 유사 오디오 데이터들이 소정 개수 N개보다 많으면, 오디오 비교 모듈(2311)은 유사도가 가장 높은 N개의 유사 오디오 데이터를 선택하여 유사 오디오 데이터 집합(460)을 생성할 수 있다. 오디오 비교 모듈(2311)은 생성된 유사 오디오 데이터 집합(460) 및 입력 오디오 데이터(450)를 출력할 수 있다. 다만, 이에 한정되는 것은 아니며, 오디오 비교 모듈(2311)은 생성된 유사 오디오 데이터 집합(460)에 대응되는 유사 오디오 신호 집합(440) 및 입력 오디오 데이터에 대응되는 입력 오디오 신호(410)를 함께 출력할 수도 있다.The audio comparison module 2311 according to an embodiment may generate the similar audio data set 460 by grouping the identified similar audio data. In this case, if the number of identified similar audio data is greater than the predetermined number N, the audio comparison module 2311 may select N similar audio data having the highest similarity to generate the similar audio data set 460 . The audio comparison module 2311 may output the generated similar audio data set 460 and input audio data 450 . However, the present invention is not limited thereto, and the audio comparison module 2311 outputs the similar audio signal set 440 corresponding to the generated similar audio data set 460 and the input audio signal 410 corresponding to the input audio data together. You may.

도 5는 일 실시예에 따른 서버가 텍스트 비교 모듈을 이용하여 입력 오디오 신호로부터 변환된 제1 텍스트와 기저장된 복수의 오디오 신호들로부터 변환된 제2 텍스트들을 비교하는 방법을 설명하기 위한 도면이다.FIG. 5 is a diagram for describing a method in which a server compares first text converted from an input audio signal with second text converted from a plurality of pre-stored audio signals using a text comparison module, according to an exemplary embodiment.

도 5를 설명함에 있어서, 도 1a, 1b, 4a 및 4b에서와 동일한 구성은 동일한 도면 부호를 이용하여 설명하기로 한다.In the description of FIG. 5 , the same components as in FIGS. 1A, 1B, 4A and 4B will be described using the same reference numerals.

일 실시예에 따른 서버(2000)는 공격 관리 모듈(2310) 내에 텍스트 비교 모듈(2312)을 포함할 수 있다. 또한, 텍스트 비교 모듈(2312)은 텍스트 획득부(510) 및 텍스트 비교부(520)를 포함할 수 있다.The server 2000 according to an embodiment may include a text comparison module 2312 in the attack management module 2310 . Also, the text comparison module 2312 may include a text acquisition unit 510 and a text comparison unit 520 .

도 5를 참조하면, 입력 오디오 신호(410) 및 유사 오디오 신호 집합(440)은 텍스트 비교 모듈(2312)로 입력될 수 있다. 다만, 이에 한정되는 것은 아니며, 도 4b에 도시된 입력 오디오 데이터(450) 및 유사 오디오 데이터 집합(460)이 텍스트 비교 모듈(2312)로 입력될 수도 있다.Referring to FIG. 5 , an input audio signal 410 and a similar audio signal set 440 may be input to a text comparison module 2312 . However, the present invention is not limited thereto, and the input audio data 450 and the similar audio data set 460 illustrated in FIG. 4B may be input to the text comparison module 2312 .

일 실시예에 따른 텍스트 비교 모듈(2312)은, 텍스트 획득부(510)를 이용하여 입력 오디오 신호로부터 변환된 제1 텍스트(501) 및 유사 오디오 신호 집합에 포함되는 유사 오디오 신호들로부터 변환된 제2 텍스트들(502)을 획득할 수 있다.The text comparison module 2312 according to an embodiment may include a first text 501 converted from an input audio signal and a second text converted from similar audio signals included in a similar audio signal set using the text obtaining unit 510 . 2 texts 502 may be obtained.

구체적으로, 입력 오디오 신호로부터 변환된 제1 텍스트(501)는, 서버 내 저장된 ASR 모델(2321)에 입력 오디오 신호가 입력되고, 입력 오디오 신호를 텍스트로 변환하는 ASR이 수행되어 획득된 것일 수 있다. 이 경우, 텍스트 획득부(510)는 입력 오디오 신호에 대하여 ASR을 수행할 것을 요청하고, 입력 오디오 신호로부터 변환된 제1 텍스트(501)를 획득할 수 있다.Specifically, the first text 501 converted from the input audio signal may be obtained by inputting the input audio signal to the ASR model 2321 stored in the server, and performing ASR converting the input audio signal into text. . In this case, the text obtaining unit 510 may request to perform ASR on the input audio signal and obtain the converted first text 501 from the input audio signal.

또한, 유사 오디오 신호 집함에 포함되는 유사 오디오 신호들로부터 변환된 제2 텍스트들(502)은, 서버 내 저장된 ASR 모델(2321)에 유사 오디오 신호들이 입력되고, 유사 오디오 신호들을 텍스트들로 변환하는 ASR이 수행됨으로써 획득된 것들일 수 있다. 이 경우, 텍스트 획득부(510)는 유사 오디오 신호들에 대하여 ASR을 수행할 것을 요청하고, 유사 오디오 신호들로부터 변환된 제2 텍스트들(502)을 획득할 수 있다.In addition, the second texts 502 converted from the analogous audio signals included in the analogous audio signal collection are input to the ASR model 2321 stored in the server, and the analogous audio signals are converted into texts. These may be those obtained by performing ASR. In this case, the text obtaining unit 510 may request to perform ASR on the analogous audio signals and obtain the converted second texts 502 from the similar audio signals.

또한, 도 5에는 도시되지 않았으나, 텍스트 획득부(510)가 획득하는 제2 텍스트들(502)은, 사용자 텍스트 데이터베이스(2332)에 기저장된 제3 텍스트들로부터 획득된 것들일 수 있다. 구체적으로, 사용자 오디오 신호 데이터베이스(2331)에 복수의 오디오 신호들이 저장될 때, 복수의 오디오 신호들에 ASR이 수행됨으로써 텍스트로 변환되고, 복수의 오디오 신호들로부터 변환된 제3 텍스트들이 사용자 텍스트 데이터베이스(2332)에 저장될 수 있다. 이 경우, 텍스트 획득부(510)는 사용자 텍스트 데이터베이스(2332)에 기저장된 제3 텍스트들 중에서 유사 오디오 신호들에 대응되는 텍스트들인 제2 텍스트들(502)을 획득할 수 있다.Also, although not shown in FIG. 5 , the second texts 502 obtained by the text obtaining unit 510 may be obtained from third texts previously stored in the user text database 2332 . Specifically, when a plurality of audio signals are stored in the user audio signal database 2331, ASR is performed on the plurality of audio signals to be converted into text, and third texts converted from the plurality of audio signals are stored in the user text database (2332) may be stored. In this case, the text acquisition unit 510 may acquire the second texts 502 that are texts corresponding to the similar audio signals from among the third texts previously stored in the user text database 2332 .

일 실시예에 따른 텍스트 비교 모듈(2312)은, 텍스트 비교부(520)를 이용하여 입력 오디오 신호로부터 변환된 제1 텍스트(501)와 유사 오디오 신호 집합에 포함되는 유사 오디오 신호들로부터 변환된 제2 텍스트들(502)을 비교할 수 있다.The text comparison module 2312 according to an embodiment includes the first text 501 converted from the input audio signal and the second text converted from the analogous audio signals included in the analogous audio signal set using the text comparison unit 520 . The two texts 502 can be compared.

예를 들어, 텍스트 비교 모듈(2312)은 제1 텍스트(501)를 제2 텍스트들(502) 각각과 비교하여 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들(530)을 산출할 수 있다. 예를 들어, 제1 텍스트와 제2 텍스트에 포함되는 제1 유사 오디오 텍스트를 비교하여 제1 차이값(531)을 획득하고, 제1 텍스트와 제2 텍스트에 포함되는 제2 유사 오디오 텍스트를 비교하여 제2 차이값(532)을 획득하고, 제1 텍스트와 제2 텍스트에 포함되는 제3 유사 오디오 텍스트를 비교하여 제3 차이값(533)을 획득할 수 있다.For example, the text comparison module 2312 may compare the first text 501 with each of the second texts 502 to calculate difference values 530 indicating a difference between the first text and the second texts. have. For example, a first difference value 531 is obtained by comparing the first text and audio similar text included in the second text, and the first text and the second audio text text included in the second text are compared. Thus, a second difference value 532 may be obtained, and a third difference value 533 may be obtained by comparing the first text and the third similar audio text included in the second text.

일 실시예에 따른 텍스트 비교 모듈(2312)이 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들(530)을 산출하는 방법은, 다양한 알고리즘이 적용되어 수행될 수 있다.A method in which the text comparison module 2312 according to an embodiment calculates the difference values 530 representing the difference between the first text and the second text may be performed by applying various algorithms.

예를 들어, 제1 텍스트와 제2 텍스트들 간의 차이 값들(530)을 산출하는 방법은, 제1 텍스트(501)에 포함되는 문자들과 각각의 제2 텍스트들(502)에 포함되는 문자들 간 차이에 기초하여, 차이 값들을 산출하는 문자 간 편집 거리(Character-wise edit distance) 알고리즘일 수 있다.For example, the method of calculating the difference values 530 between the first text and the second texts includes characters included in the first text 501 and characters included in each of the second texts 502 . It may be a character-wise edit distance algorithm that calculates difference values based on the difference between characters.

다른 예에서, 제1 텍스트와 제2 텍스트들 간의 차이 값들(530)을 산출하는 방법은, 제1 텍스트(501)에 포함되는 단어들과 각각의 제2 텍스트들(502)에 포함되는 단어들 간 차이에 기초하여, 텍스트 간 차이를 나타내는 차이 값들(530)을 산출하는 단어 간 편집 거리(Word-wise edit distance) 알고리즘일 수 있다.In another example, the method of calculating the difference values 530 between the first text and the second texts includes words included in the first text 501 and words included in each of the second texts 502 . It may be a word-wise edit distance algorithm that calculates difference values 530 representing differences between texts based on the difference between words.

다른 예에서, 제1 텍스트와 제2 텍스트들 간의 차이 값들(530)을 산출하는 방법은, 제1 텍스트(501)의 발음 표현과 각각의 제2 텍스트들(502)의 발음 표현들 간 차이에 기초하여, 차이 값들을 산출하는 발음 표현 편집 거리(Phonetic representation edit distance) 알고리즘일 수 있다. 다만, 전술한 예시들에 한정되는 것은 아니며, 텍스트 비교 모듈(2312)은 다양한 방법들을 이용하여 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들(530)을 산출할 수 있다. 또한, 텍스트 비교 모듈(2312)은 전술한 방법들을 이용하여, 제2 텍스트들 간의 차이를 나타내는 차이 값들을 산출할 수 있다.In another example, the method of calculating the difference values 530 between the first text and the second texts is based on the difference between the phonetic representation of the first text 501 and the phonetic representations of each of the second texts 502 . Based on it, it may be a phonetic representation edit distance algorithm that calculates difference values. However, the present invention is not limited to the above-described examples, and the text comparison module 2312 may calculate difference values 530 indicating a difference between the first text and the second text using various methods. Also, the text comparison module 2312 may calculate difference values indicating a difference between the second texts by using the above-described methods.

일 실시예에 따른 텍스트 비교 모듈(2312)은, 산출된 제1 텍스트와 제2 텍스트들 간의 차이를 나타내는 차이 값들(530)을 출력할 수 있다.The text comparison module 2312 according to an embodiment may output difference values 530 indicating a difference between the calculated first text and the second text.

도 6은 일 실시예에 따른 서버가 공격 판단 모듈을 이용하여 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지를 판단하는 방법을 설명하기 위한 도면이다.6 is a diagram for explaining a method for a server to determine whether an input audio signal is an audio signal for an attack on a voice assistant service using an attack determination module, according to an embodiment.

도 6을 설명함에 있어서, 도 1a, 1b 내지 도 5에서와 동일한 구성은 동일한 도면 부호를 이용하여 설명하기로 한다.In the description of FIG. 6 , the same components as in FIGS. 1A, 1B to 5 will be described using the same reference numerals.

도 6을 참조하면, 텍스트 비교 모듈로부터 출력된 텍스트 차이 값들(530)은 공격 판단 모듈(2313)로 입력될 수 있다.Referring to FIG. 6 , text difference values 530 output from the text comparison module may be input to the attack determination module 2313 .

일 실시예에서, 공격 판단 모듈(2313)은, 텍스트 차이 값들(530)에 기초하여, 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호인지 여부를 판단할 수 있다. 예를 들어, 입력 오디오 신호와 기저장된 오디오 신호들을 비교한 결과, 유사 오디오 신호들이 여러 번 입력되었고, 입력 오디오 신호로부터 변환된 텍스트가 기입력된 유사 오디오 신호들로부터 변환된 텍스트들로부터 변형(variation)되는 정도가 일정 수치 이상인 경우, 입력 오디오 신호를 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호라고 판단할 수 있다.In an embodiment, the attack determination module 2313 may determine whether the input audio signal is an audio signal for an attack on the voice assistant service, based on the text difference values 530 . For example, as a result of comparing the input audio signal and the pre-stored audio signals, the analogous audio signals are input several times, and text converted from the input audio signal is transformed from the text converted from the written analogous audio signals. ), it may be determined that the input audio signal is an audio signal for attacking the voice assistant service.

구체적으로, 공격 판단 모듈(2313)은 제1 텍스트와 제2 텍스트들을 비교한 텍스트 차이 값들(530) 중에서, 텍스트 차이 값이 제1 임계값 이상 텍스트 차이 값들을 식별할 수 있다.Specifically, the attack determination module 2313 may identify text difference values having a text difference value equal to or greater than a first threshold value from among the text difference values 530 comparing the first text and the second texts.

일 실시예에서, 텍스트 차이 값이 제1 임계값 이상인 값들은, 입력 오디오 신호가 유사 오디오 신호들과 파형은 유사하지만, 각각의 오디오 신호에 ASR을 수행하여 변환된 텍스트들은 차이가 있음을 의미하는 것일 수 있다. 따라서, 텍스트 차이 값이 제1 임계값 이상인 값들이 존재하는 경우, 입력 오디오 신호는 음성 공격을 생성하려는 사용자(적대적 사용자(150))에 의한 오디오 신호일 수 있다.In one embodiment, the values of the text difference value equal to or greater than the first threshold value mean that the input audio signal has a similar waveform to similar audio signals, but texts converted by performing ASR on each audio signal are different it could be Accordingly, when there are values in which the text difference value is equal to or greater than the first threshold value, the input audio signal may be an audio signal by the user (the hostile user 150 ) who intends to generate a voice attack.

구체적으로, 입력 오디오 신호는 적대적 사용자(150)가 원하는 음성 공격 신호에 대응되는 텍스트를 찾기 위해, 입력 오디오 신호로부터 변환된 텍스트가 기 입력된 유사 오디오 신호들로부터 변환된 텍스트들과 다르게 변환되도록 입력 오디오 신호를 변형하여 생성된 오디오 신호일 수 있다.Specifically, the input audio signal is input so that the text converted from the input audio signal is converted differently from the texts converted from the previously input similar audio signals in order to find the text corresponding to the voice attack signal desired by the hostile user 150 It may be an audio signal generated by transforming an audio signal.

공격 판단 모듈(2313)은, 텍스트 차이 값이 제1 임계값 이상인 값들이 제2 임계값 이상인 경우, 입력된 입력 오디오 신호를 음성 공격 생성을 위한 오디오 신호로 판단할 수 있다. 이는, 제1 텍스트가 제2 텍스트들에 대해 변형된 정도가 제1 임계값 이상이고, 제1 텍스트가 변형된 정도가 제1 임계값 이상임을 나타내는 제2 텍스트들의 개수가 제2 임계값 이상 존재하는 것을 의미할 수 있다. 이 경우, 공격 판단 모듈은 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호로 판단하고, 입력 오디오 신호가 적대적 사용자(150)에 의한 오디오 신호임을 나타내는 데이터를 공격 대응 모듈(2314)로 출력할 수 있다.The attack determination module 2313 may determine the input audio signal as an audio signal for generating a voice attack when values of the text difference value equal to or greater than the first threshold value are equal to or greater than the second threshold value. This means that the number of second texts indicating that the degree of deformation of the first text with respect to the second texts is equal to or greater than the first threshold value and that the degree of deformation of the first text is equal to or greater than the first threshold value is greater than or equal to the second threshold value. can mean doing In this case, the attack determination module determines that the input audio signal is an audio signal for an attack on the voice assistant service, and outputs data indicating that the input audio signal is an audio signal by the hostile user 150 to the attack response module 2314 can do.

공격 대응 모듈(2314)은 입력 오디오 신호가 적대적 사용자(150)에 의한 오디오 신호임을 나타내는 데이터를 입력 받은 경우, 음성 공격을 방어하기 위한 동작을 수행할 수 있다.When receiving data indicating that the input audio signal is an audio signal by the hostile user 150 , the attack response module 2314 may perform an operation for defending against a voice attack.

또한, 공격 판단 모듈(2313)은 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호가 아니라고 판단되는 경우, 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호가 아님을 나타내는 데이터를 보이스 어시스턴트 모듈(2320)로 출력할 수 있다.In addition, when it is determined that the input audio signal is not an audio signal for generating a voice attack, the attack determination module 2313 transmits data indicating that the input audio signal is not an audio signal for generating a voice attack to the voice assistant module 2320 ) can be printed.

보이스 어시스턴트 모듈(2320)은 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호가 아님을 나타내는 데이터를 입력 받은 경우, 입력 오디오 신호에 대한 응답을 클라이언트 디바이스로 전송할 수 있다.When receiving data indicating that the input audio signal is not an audio signal for generating a voice attack, the voice assistant module 2320 may transmit a response to the input audio signal to the client device.

구체적으로, 보이스 어시스턴트 모듈(2320)은 입력 오디오 신호로부터 변환된 텍스트는 NLU 모델(2322)로 입력하여, 텍스트를 파싱(parsing)하고, 파싱된 텍스트에 포함되는 인텐트들을 분석할 수 있다. 또한, 보이스 어시스턴트 모듈(2320)은 인텐트 분석 결과를 디바이스 제어 정보 생성 모듈(2324)로 입력함으로써 클라이언트 디바이스 및 주변 디바이스들을 제어하기 위한 제어 정보를 생성하고, 제어 정보를 클라이언트 디바이스(1000)로 전송할 수 있다.Specifically, the voice assistant module 2320 may input the text converted from the input audio signal into the NLU model 2322, parse the text, and analyze intents included in the parsed text. In addition, the voice assistant module 2320 generates control information for controlling the client device and peripheral devices by inputting the intent analysis result to the device control information generation module 2324 , and transmits the control information to the client device 1000 . can

이 경우, 클라이언트 디바이스(1000)는 입력 오디오 신호에 대한 응답을 수신 받아 클라이언트 디바이스(1000) 또는 주변 디바이스들(3000)의 동작을 제어할 수 있다.In this case, the client device 1000 may receive a response to the input audio signal to control the operation of the client device 1000 or the peripheral devices 3000 .

도 7은 일 실시예에 따른 서버가 공격 대응 모듈을 이용하여 음성 공격의 생성을 방어하기 위한 동작을 결정하는 방법을 설명하기 위한 도면이다.7 is a diagram for describing a method for a server to determine an operation for preventing generation of a voice attack by using an attack response module according to an embodiment.

일 실시예에서, 공격 대응 모듈(2314)은 입력 오디오 신호가 보이스 어시스턴트 서비스에 대한 공격을 위한 오디오 신호로 판단되는 경우, 이를 방어하기 위한 동작을 수행할 수 있다.In an embodiment, when it is determined that the input audio signal is an audio signal for an attack on the voice assistant service, the attack response module 2314 may perform an operation for defending it.

예를 들어, 공격 대응 모듈(2314)은 공격 판단 모듈(2313)로부터 입력 오디오 신호가 적대적 사용자(150)에 의한 오디오 신호임을 나타내는 데이터를 입력 받은 경우, 음성 명령에 대한 응답을 요청한 적대적 사용자(150)의 계정을 차단할 수 있다. 적대적 사용자의 계정은 서버(2000)의 보이스 어시스턴트 서비스의 사용자 계정일 수 있으며, 적대적 사용자의 계정이 차단됨으로써 적대적 사용자에 대한 보이스 어시스턴트 서비스의 제공이 중단될 수 있다. 이 경우, 서버(2000)는 이후 입력되는 적대적 사용자(150)의 입력 오디오 신호에 대하여, 음성 명령에 대한 응답을 요청 받더라도 응답하지 않을 수 있다. 따라서, 음성 공격을 생성하려는 적대적 사용자(150)는 현재 입력된 입력 오디오 신호 다음으로 입력되는 입력 오디오 신호들에 대하여, 다음 입력 오디오 신호들이 적대적 사용자(150)가 원하는 음성 공격에 대응되는 텍스트로 전사되었는지 확인할 수 없게 된다.For example, when the attack response module 2314 receives data indicating that the input audio signal is an audio signal by the hostile user 150 from the attack determination module 2313, the hostile user 150 who requests a response to the voice command ) account can be blocked. The hostile user's account may be a user account of the voice assistant service of the server 2000 , and as the hostile user's account is blocked, provision of the voice assistant service to the hostile user may be stopped. In this case, the server 2000 may not respond to a subsequent input audio signal of the hostile user 150 even when a response to the voice command is requested. Therefore, the hostile user 150 who wants to generate a voice attack transcribes the input audio signals input next to the currently input audio signal into text corresponding to the voice attack desired by the hostile user 150. It is not possible to check whether

다른 예에서, 공격 대응 모듈(2314)은, 공격 판단 모듈(2313)로부터 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호임을 나타내는 데이터를 입력 받은 경우 입력 오디오 신호로부터 변환된 제1 텍스트를 반환하지 않고, 임의의 다른 텍스트를 반환할 수 있다.In another example, when the attack response module 2314 receives data indicating that the input audio signal is an audio signal for generating a voice attack from the attack determination module 2313, the first text converted from the input audio signal is not returned. , and can return any other text.

예를 들어, 공격 대응 모듈(2314)은 입력 오디오 신호로부터 변환된 제1 텍스트 대신에, 유사 오디오 신호들로부터 변환된 제2 텍스트들 중 하나의 텍스트를 반환할 수 있다. 다만, 이에 한정되는 것은 아니며, 공격 대응 모듈(2314)은 제1 텍스트 대신에 임의의 텍스트를 반환하거나, 어느 텍스트도 반환하지 않을 수 있다. 따라서, 음성 공격을 생성하려는 적대적 사용자(150)는 현재 입력하는 입력 오디오 신호에 대응되는 제1 텍스트를 확인할 수 없게 된다.For example, the attack response module 2314 may return one text among the second texts converted from the similar audio signals instead of the first text converted from the input audio signal. However, the present invention is not limited thereto, and the attack response module 2314 may return an arbitrary text instead of the first text or may not return any text. Accordingly, the hostile user 150 attempting to generate a voice attack cannot confirm the first text corresponding to the currently inputted audio signal.

도 8은 일 실시예에 따른 서버가 디바이스에 대한 음성 공격의 생성을 탐지하기 위해 동작하는 다른 실시예를 설명하기 위한 도면이다.8 is a diagram for describing another embodiment in which a server operates to detect generation of a voice attack on a device according to an embodiment.

일 실시예에 따른 사용자는 음성 공격을 생성하려는 적대적 사용자(150)일 수 있다. 이 경우, 적대적 사용자(150)의 공격 오디오는 스피커 등에 의해 출력되어 클라이언트 디바이스(1000)에 오디오 신호 형태의 입력으로 전달될 수 있다.According to one embodiment, the user may be a hostile user 150 attempting to create a voice attack. In this case, the attack audio of the hostile user 150 may be output by a speaker or the like and transmitted to the client device 1000 as an input in the form of an audio signal.

일 실시예에 따른 클라이언트 디바이스(1000)는 클라이언트 디바이스에 입력되는 오디오 신호에 호출 명령이 포함되면, 호출 명령에 응답하여 보이스 어시스턴트 서비스를 제공하기 위한 기능이 활성화되고, 서버(2000)로 응답을 요청할 수 있다. In the client device 1000 according to an embodiment, when a call command is included in an audio signal input to the client device, a function for providing a voice assistant service is activated in response to the call command and requests a response from the server 2000 . can

적대적 사용자(150)가 음성 공격을 생성하는 과정에 있어서, 음성 공격의 생성이 미완성 단계인 경우, 클라이언트 디바이스(1000)에 입력되는 적대적 사용자(150)의 공격 오디오가 정확한 호출 명령을 포함하지 않아 클라이언트 디바이스(1000)가 활성화되지 않을 수 있다. 그러나, 서버(2000)는 음성 공격의 생성을 방지하기 위해, 유사 오디오 신호들이 반복적으로 입력되는 이벤트의 감지 및 유사 오디오 신호들로부터 변환된 텍스트들을 확인 및 비교해야 하므로, 정확한 호출 명령을 포함하지 않는 유사 오디오 신호들에 대해서도 유사 오디오 신호들을 저장하고, 유사 오디오 신호들로부터 변환된 텍스트들을 확인할 필요가 있다. 이 경우, 클라이언트 디바이스(1000)는 정확한 호출 명령이 아닌 유사한 호출 명령을 포함하는 오디오 신호들에 대해서도 활성화 되어, 음성 명령에 대한 응답을 서버(2000)로 요청할 수 있다.In the process of the hostile user 150 generating a voice attack, when the generation of the voice attack is incomplete, the attack audio of the hostile user 150 input to the client device 1000 does not include an accurate call command, so the client The device 1000 may not be activated. However, since the server 2000 needs to detect an event in which similar audio signals are repeatedly input and check and compare texts converted from the similar audio signals to prevent the generation of a voice attack, the server 2000 does not include an accurate call command. It is necessary to store similar audio signals for similar audio signals, and to check texts converted from the similar audio signals. In this case, the client device 1000 may be activated for audio signals including a similar call command rather than an exact call command, and may request a response to the voice command from the server 2000 .

예를 들어, 클라이언트 디바이스(1000)를 호출하기 위한 정확한 호출 명령은 "하이 빅스비(Hi Bixby)"일 수 있으나, 적대적 사용자(150)의 공격 오디오는 정확한 호출 명령인 "하이 빅스비(Hi, Bixby)"를 포함하지 않을 수 있다.For example, the correct call command for calling the client device 1000 may be “Hi Bixby”, but the attack audio of the hostile user 150 is the exact call command “Hi Bixby (Hi, Bixby)" may not be included.

구체적으로, 적대적 사용자(150)에 공격 오디오에 포함되는 호출 명령은, "하이 빅스(Hi Bix)", "하이 비스비(Hi Bisby)", "하이 비비(Hi Bibi)" 등의 유사 호출 명령들일 수 있다. 이 경우, 일 실시예에 따른 클라이언트 디바이스(1000)는 유사 호출 명령들에 의해 활성화되어, 서버(2000)로 유사 호출 명령을 포함하는 오디오 신호를 전송할 수 있다. 일 실시예에 따른 서버(2000)는 유사 호출 명령을 포함하는 오디오 신호를 오디오 신호 데이터 베이스에 저장하여 기저장된 유사 오디오 신호들의 데이터 베이스를 생성하거나, 유사 호출 명령을 포함하는 오디오 신호를 기저장된 유사 오디오 신호들과 비교하여, 유사 호출 명령을 포함하는 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호인지 여부를 판단할 수 있다. 서버가 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호인지 여부를 판단하는 방법은, 도 3에서 전술하였으므로 설명을 생략하기로 한다.Specifically, the call command included in the attack audio to the hostile user 150 is a similar call command such as "Hi Bix", "Hi Bisby", "Hi Bibi", etc. can take In this case, the client device 1000 according to an embodiment may be activated by the similar call commands to transmit an audio signal including the similar call command to the server 2000 . The server 2000 according to an embodiment stores an audio signal including a similar call command in an audio signal database to generate a database of pre-stored similar audio signals, or stores an audio signal including a similar call command in a pre-stored similarity audio signal database. By comparing the audio signals, it may be determined whether the audio signal including the similar call command is an audio signal for generating a voice attack. A method for the server to determine whether an input audio signal is an audio signal for generating a voice attack has been described above with reference to FIG. 3 , and thus a description thereof will be omitted.

도 9는 일 실시예에 따른 서버가 데이터베이스에 입력 오디오 신호를 수신하여 기저장된 복수의 오디오 신호들을 생성하는 방법을 설명하기 위한 도면이다.9 is a diagram for describing a method in which a server receives an input audio signal in a database and generates a plurality of pre-stored audio signals according to an exemplary embodiment.

단계 S910에서, 서버(2000)는 클라이언트 디바이스(1000)로부터 입력 오디오 신호를 수신하여 획득할 수 있다. 이는, 도 3의 단계 S310과 동일하므로 설명을 생략하기로 한다.In operation S910 , the server 2000 may receive and obtain an input audio signal from the client device 1000 . Since this is the same as step S310 of FIG. 3 , a description thereof will be omitted.

단계 S920에서, 서버(2000)는 입력 오디오 신호를 데이터베이스(2330)에 저장할 수 있다. 입력 오디오 신호는 원본 형태로 저장될 수 있으며, 도 4b에서 설명한 것과 같이 입력 오디오 신호에 소정의 데이터 처리 방법이 적용되어 변환된 데이터 형태로 저장될 수 있다. 또한, 사용자가 복수인 경우, 입력 오디오 데이터는 복수의 사용자 별로 각각 구분되어 저장될 수 있다.In step S920 , the server 2000 may store the input audio signal in the database 2330 . The input audio signal may be stored in the original form, and may be stored in the converted data form by applying a predetermined data processing method to the input audio signal as described with reference to FIG. 4B . Also, when there are a plurality of users, the input audio data may be stored separately for each of the plurality of users.

단계 S930에서, 서버(2000)는 입력 오디오 신호가 정확한 음성 명령으로 인식되는지 확인할 수 있다. 예를 들어, 서버(2000)는 입력 오디오 신호에 ASR을 수행하여 텍스트로 변환하고, NLU 모델을 이용하여 텍스트를 분석할 수 있다. 일 실시예에 따른 서버(2000)는 텍스트 분석 결과 클라이언트 디바이스(1000)를 제어하기 위한 정확한 음성 명령(예를 들어, "하이 빅스비, 오늘 일정 알려줘") 또는 주변 디바이스를 제어하기 위한 정확한 음성 명령(예를 들어, "하이 빅스비, 현관문 열어줘")으로 판단되는 경우, 입력 오디오 신호에 대한 응답을 클라이언트 디바이스로 전송(단계 S935)할 수 있다.In operation S930, the server 2000 may check whether the input audio signal is recognized as an accurate voice command. For example, the server 2000 may perform ASR on the input audio signal to convert it into text, and analyze the text using the NLU model. As a result of the text analysis, the server 2000 according to an embodiment provides an accurate voice command for controlling the client device 1000 (eg, “Hi Bixby, tell me today’s schedule”) or an accurate voice command for controlling a peripheral device. (eg, "Hi Bixby, open the front door"), a response to the input audio signal may be transmitted to the client device (step S935).

단계 S940에서, 일 실시예에 따른 클라이언트 디바이스(1000)는 서버(2000)로부터 음성 명령에 대한 응답을 수신하여, 클라이언트 디바이스(1000) 및 주변 디바이스를 제어할 수 있다. 예를 들어, 보이스 어시스턴트를 사용하고자 하는 일반 사용자의 음성 명령이 "하이 빅스비, 오늘 일정 알려줘" 였던 경우, 클라이언트 디바이스(1000)는 일반 사용자의 오늘 일정을 브리핑하는 오디오를 출력할 수 있다. 다른 예에서, 일반 사용자의 음성 명령이 "하이 빅스비, 현관문 열어줘" 였던 경우, 클라이언트 디바이스(1000)는 현관문을 열기 위해 현관문에 부착된 주변 디바이스(3000)를 제어할 수 있다.In operation S940 , the client device 1000 according to an embodiment may receive a response to the voice command from the server 2000 to control the client device 1000 and peripheral devices. For example, when the voice command of a general user who wants to use the voice assistant is “Hi Bixby, tell me about today’s schedule”, the client device 1000 may output an audio briefing the general user’s today’s schedule. In another example, when the general user's voice command is "Hi Bixby, open the front door," the client device 1000 may control the peripheral device 3000 attached to the front door to open the front door.

다시 단계 S930을 참조하면, 서버(2000)는 입력 오디오 신호가 정확한 음성 명령으로 인식되는지 확인할 수 있다. 예를 들어, 서버(2000)는 입력 오디오 신호에 ASR을 수행하여 텍스트로 변환하고, NLU 모델을 이용하여 텍스트를 분석할 수 있다. 텍스트 분석 결과 텍스트의 의미 등이 불분명하여 음성 명령으로 인식될 수 없는 경우, 입력 오디오 신호는 부정확하게 발화된 오디오 신호거나, 음성 공격을 생성하기 위한 오디오 신호일 수 있다. 서버(2000)는 입력 오디오 신호가 정확한 음성 명령으로 인식되지 않는 경우, 단계 S950을 수행할 수 있다.Referring back to step S930, the server 2000 may check whether the input audio signal is recognized as an accurate voice command. For example, the server 2000 may perform ASR on the input audio signal to convert it into text, and analyze the text using the NLU model. As a result of text analysis, when the meaning of the text is unclear and thus cannot be recognized as a voice command, the input audio signal may be an inaccurately uttered audio signal or an audio signal for generating a voice attack. When the input audio signal is not recognized as an accurate voice command, the server 2000 may perform step S950.

단계 S950에서, 서버(2000)는 데이터베이스(2330)에 기저장된 복수의 오디오 신호들이 존재하는지 확인할 수 있다. In operation S950 , the server 2000 may check whether a plurality of audio signals previously stored in the database 2330 exist.

데이터 베이스(2330)에 기저장된 복수의 오디오 신호들이 존재하는 경우, 서버(2000)는 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호인지 여부를 판단하는 동작들을 수행할 수 있다.When a plurality of audio signals pre-stored in the database 2330 exist, the server 2000 may perform operations of determining whether the input audio signal is an audio signal for generating a voice attack.

데이터 베이스(2330)에 기저장된 복수의 오디오 신호들이 존재하지 않는 경우, 서버(2000)는 동작을 종료할 수 있다.When the plurality of audio signals previously stored in the database 2330 do not exist, the server 2000 may end the operation.

도 10은 일 실시예에 따른 서버가 음성 공격의 생성을 탐지하고, 탐지 결과에 기초하여 디바이스 제어 동작을 수행하거나 음성 공격 생성의 방어 동작을 수행하는 방법을 설명하기 위한 흐름도이다.10 is a flowchart illustrating a method in which a server detects generation of a voice attack and performs a device control operation or a defense operation of voice attack generation based on a detection result, according to an exemplary embodiment.

단계 S1010 내지 S1060에서, 서버(2000)는 입력 오디오 신호를 수신하고, 수신된 입력 오디오 신호가 음성 공격을 생성하기 위한 오디오 신호인지 여부를 탐지할 수 있다. 이는, 도 3의 단계 S310 내지 S360에 대응되므로, 여기에서는 설명을 생략하기로 한다.In steps S1010 to S1060, the server 2000 may receive an input audio signal, and detect whether the received input audio signal is an audio signal for generating a voice attack. Since this corresponds to steps S310 to S360 of FIG. 3 , a description thereof will be omitted herein.

서버(2000)는 수신된 입력 오디오 신호가 음성 공격 생성 단계의 오디오 신호가 아니라고 판단되는 경우, 입력 오디오 신호에 대한 응답을 클라이언트 디바이스(1000)로 전송(단계 S1065)할 수 있다.When it is determined that the received input audio signal is not the audio signal of the voice attack generation step, the server 2000 may transmit a response to the input audio signal to the client device 1000 (step S1065).

단계 S1070에서, 클라이언트 디바이스(1000)는 입력 오디오 신호에 대한 응답을 수신 받아 클라이언트 디바이스(1000) 또는 주변 디바이스들(3000)의 동작을 제어할 수 있다. 이는 도 9의 단계 S940에 대응되므로, 여기에서는 설명을 생략하기로 한다.In operation S1070 , the client device 1000 may receive a response to the input audio signal to control the operation of the client device 1000 or the peripheral devices 3000 . Since this corresponds to step S940 of FIG. 9 , a description thereof will be omitted herein.

서버(2000)는 수신된 입력 오디오 신호가 음성 공격 생성 단계의 오디오 신호인 것으로 판단되는 경우, 단계 S1080을 수행할 수 있다. When it is determined that the received input audio signal is the audio signal of the voice attack generation step, the server 2000 may perform step S1080.

단계 S1080에서, 음성 공격 생성 단계를 방어하기 위한 서버의 동작을 결정할 수 있다. 예를 들어, 입력 오디오 신호로부터 변환된 텍스트가 아닌 다른 텍스트 반환하거나, 음성 공격을 생성 중인 것으로 판단되는 사용자를 차단할 수 있다. 이는 도 7에 대한 설명에서 전술하였으므로, 여기에서는 설명을 생략하기로 한다. In step S1080, it is possible to determine the operation of the server to defend the voice attack generation step. For example, a text other than the text converted from the input audio signal may be returned, or a user determined to be generating a voice attack may be blocked. Since this has been described above in the description of FIG. 7 , a description thereof will be omitted herein.

도 11은 일 실시예에 따른 클라이언트 디바이스의 블록도이다.11 is a block diagram of a client device according to an embodiment.

도 11을 참조하면, 일 실시예에 따른 클라이언트 디바이스(1000)는 입력부(1100), 출력부(1200), 프로세서(1300), 메모리(1400) 및 통신 인터페이스(1500)를 포함할 수 있다.Referring to FIG. 11 , the client device 1000 according to an embodiment may include an input unit 1100 , an output unit 1200 , a processor 1300 , a memory 1400 , and a communication interface 1500 .

일 실시예에 따르면, 클라이언트 디바이스(1000)는 적어도 하나 이상일 수 있다. 또한, 클라이언트 디바이스(1000)는 예를 들어, 보이스 어시스턴트가 탑재될 수 있는 스마트 폰, 태블릿 PC, PC, 랩톱, 스마트 TV, 스마트 냉장고, 인공지능 스피커 등의 장치일 수 있으나, 이에 한정되지 않는다.According to an embodiment, there may be at least one client device 1000 . Also, the client device 1000 may be, for example, a smart phone, a tablet PC, a PC, a laptop, a smart TV, a smart refrigerator, or an artificial intelligence speaker in which a voice assistant may be mounted, but is not limited thereto.

입력부(1100)는, 사용자가 클라이언트 디바이스(1000)를 제어하기 위한 데이터를 입력하는 수단을 의미한다. 예를 들어, 입력부(1100)에는 마이크, 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드(접촉식 정전 용량 방식, 압력식 저항막 방식, 적외선 감지 방식, 표면 초음파 전도 방식, 적분식 장력 측정 방식, 피에조 효과 방식 등), 조그 휠, 조그 스위치 등이 있을 수 있으나 이에 한정되는 것은 아니다.The input unit 1100 means a means for a user to input data for controlling the client device 1000 . For example, the input unit 1100 includes a microphone, a key pad, a dome switch, and a touch pad (contact capacitive method, pressure resistance film method, infrared sensing method, surface ultrasonic conduction method, red There may be a mechanical tension measurement method, a piezo effect method, etc.), a jog wheel, a jog switch, and the like, but is not limited thereto.

입력부(1100)는 클라이언트 디바이스(1000)를 호출하기 위한 사용자의 음성 명령을 수신할 수 있다.The input unit 1100 may receive a user's voice command for calling the client device 1000 .

출력부(1200)는, 오디오 신호 또는 비디오 신호 또는 진동 신호를 출력할 수 있으며, 출력부(1200)는 디스플레이부, 음향 출력부, 또는 진동 모터 중 적어도 하나를 포함할 수 있다.The output unit 1200 may output an audio signal, a video signal, or a vibration signal, and the output unit 1200 may include at least one of a display unit, a sound output unit, and a vibration motor.

프로세서(1300)는, 클라이언트 디바이스(1000)의 전반적인 동작을 제어한다. 예를 들어, 프로세서(1300)는, 메모리(1400)에 저장된 프로그램들을 실행함으로써, 입력부(1100), 출력부(1200), 메모리(1400) 및 통신 인터페이스(1500) 등을 전반적으로 제어할 수 있다.The processor 1300 controls the overall operation of the client device 1000 . For example, the processor 1300 may generally control the input unit 1100 , the output unit 1200 , the memory 1400 , the communication interface 1500 , and the like by executing programs stored in the memory 1400 . .

프로세서(1300)는 입력 오디오 신호를 수신하고 입력 오디오 신호에 포함되는 호출 명령을 인식하여 서버(2000)로 음성 명령에 대한 응답 요청을 전송할 수 있다. 또한 프로세서(1300)는 클라이언트 디바이스(1000) 및 주변 디바이스들의 기능에 관련된 발화 데이터를 생성하고 편집하기 위한 질의 메시지를 서버(2000)로부터 수신하여 출력할 수 있다.The processor 1300 may receive the input audio signal, recognize a call command included in the input audio signal, and transmit a response request to the voice command to the server 2000 . In addition, the processor 1300 may receive and output a query message for generating and editing utterance data related to functions of the client device 1000 and peripheral devices from the server 2000 .

통신 인터페이스(1500)는, 서버(2000), IoT 클라우드 서버(미도시) 및 주변 디바이스(미도시)와 통신을 하게 하는 하나 이상의 구성요소를 포함할 수 있다. 예를 들어, 통신 인터페이스(1500)는, 근거리 통신부, 이동 통신부를 포함할 수 있다. The communication interface 1500 may include one or more components that allow communication with the server 2000 , the IoT cloud server (not shown), and a peripheral device (not shown). For example, the communication interface 1500 may include a short-range communication unit and a mobile communication unit.

근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. Short-range wireless communication unit, Bluetooth communication unit, BLE (Bluetooth Low Energy) communication unit, near field communication unit (Near Field Communication unit), WLAN (Wi-Fi) communication unit, Zigbee communication unit, infrared (IrDA, infrared) Data Association) communication unit, WFD (Wi-Fi Direct) communication unit, UWB (ultra wideband) communication unit, and may include an Ant+ communication unit, but is not limited thereto.

이동 통신부는, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신한다. 여기에서, 무선 신호는, 음성 신호, 화상 통화 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다.The mobile communication unit transmits/receives a radio signal to and from at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include various types of data according to transmission and reception of a voice signal, a video call signal, or a text/multimedia message.

메모리(1400)는, 프로세서(1300)의 처리 및 제어를 위한 프로그램을 저장할 수 있고, 클라이언트 디바이스(1000)로 입력되거나 클라이언트 디바이스(1000)로부터 출력되는 데이터를 저장할 수 있다.The memory 1400 may store a program for processing and control of the processor 1300 , and may store data input to or output from the client device 1000 .

메모리(1400)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다.The memory 1400 may include a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory), and a RAM. (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk , may include at least one type of storage medium among optical disks.

SDK 모듈(미도시)은 프로세서(1300)에 의해 실행되어 클라이언트 디바이스(1000) 및 주변 디바이스(미도시)의 제어를 위해 필요한 동작을 수행할 수 있다. SDK 모듈(미도시)은 서버(2000)로부터 다운로드되어 클라이언트 디바이스(1000) 내에 설치될 수 있다. SDK 모듈(미도시)은 클라이언트 디바이스(1000) 및 주변 디바이스(미도시)의 제어를 위한 GUI를 클라이언트 디바이스(1000)의 화면 상에 출력할 수 있다. 만약, 클라이언트 디바이스(1000)가 디스플레이 장치를 포함하지 않는 장치인 경우에, SDK 모듈(미도시)은 클라이언트 디바이스(1000)가 클라이언트 디바이스(1000) 및 주변 디바이스(미도시)의 제어를 위한 음성 메시지를 출력하도록 할 수 있다. SDK 모듈(미도시)은 클라이언트 디바이스(1000)가 사용자로부터의 응답을 수신하여 서버(2000)에게 제공하도록 할 수 있다.The SDK module (not shown) may be executed by the processor 1300 to perform operations necessary for controlling the client device 1000 and peripheral devices (not shown). The SDK module (not shown) may be downloaded from the server 2000 and installed in the client device 1000 . The SDK module (not shown) may output a GUI for controlling the client device 1000 and peripheral devices (not shown) on the screen of the client device 1000 . If the client device 1000 is an apparatus that does not include a display device, the SDK module (not shown) provides a voice message for the client device 1000 to control the client device 1000 and peripheral devices (not shown). can be output. The SDK module (not shown) may allow the client device 1000 to receive a response from a user and provide it to the server 2000 .

한편, 도 2 및 도 11에 도시된 클라이언트 디바이스(1000), 서버(2000)의 블록도는 일 실시예를 위한 블록도이다. 블록도의 각 구성요소는 실제 구현되는 각 장치의 사양에 따라 통합, 추가 또는 생략될 수 있다. 즉 필요에 따라 2 이상의 구성요소가 하나의 구성요소로 합쳐지거나, 혹은 하나의 구성요소가 2 이상의 구성요소로 세분되어 구성될 수 있다. 또한, 각 블록에서 수행하는 기능은 실시예들을 설명하기 위한 것이며, 그 구체적인 동작이나 장치는 본 발명의 권리범위를 제한하지 아니한다.Meanwhile, block diagrams of the client device 1000 and the server 2000 illustrated in FIGS. 2 and 11 are block diagrams for an exemplary embodiment. Each component in the block diagram may be integrated, added, or omitted according to the specifications of each device that is actually implemented. That is, two or more components may be combined into one component, or one component may be subdivided into two or more components as needed. In addition, the function performed in each block is for describing the embodiments, and the specific operation or device does not limit the scope of the present invention.

일 실시예에 따른 서버의 동작방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method of operating a server according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

또한, 개시된 실시예들에 따른 서버의 동작방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다.In addition, the method of operating a server according to the disclosed embodiments may be provided by being included in a computer program product. Computer program products may be traded between sellers and buyers as commodities.

컴퓨터 프로그램 제품은 S/W 프로그램, S/W 프로그램이 저장된 컴퓨터로 읽을 수 있는 저장 매체를 포함할 수 있다. 예를 들어, 컴퓨터 프로그램 제품은 전자 장치의 제조사 또는 전자 마켓(예, 구글 플레이 스토어, 앱 스토어)을 통해 전자적으로 배포되는 S/W 프로그램 형태의 상품(예, 다운로더블 앱)을 포함할 수 있다. 전자적 배포를 위하여, S/W 프로그램의 적어도 일부는 저장 매체에 저장되거나, 임시적으로 생성될 수 있다. 이 경우, 저장 매체는 제조사의 서버, 전자 마켓의 서버, 또는 SW 프로그램을 임시적으로 저장하는 중계 서버의 저장매체가 될 수 있다.The computer program product may include a S/W program and a computer-readable storage medium in which the S/W program is stored. For example, computer program products may include products (eg, downloadable apps) in the form of S/W programs distributed electronically through manufacturers of electronic devices or electronic markets (eg, Google Play Store, App Store). have. For electronic distribution, at least a portion of the S/W program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing a SW program.

컴퓨터 프로그램 제품은, 서버 및 클라이언트 장치로 구성되는 시스템에서, 서버의 저장매체 또는 클라이언트 장치의 저장매체를 포함할 수 있다. 또는, 서버 또는 클라이언트 장치와 통신 연결되는 제3 장치(예, 스마트폰)가 존재하는 경우, 컴퓨터 프로그램 제품은 제3 장치의 저장매체를 포함할 수 있다. 또는, 컴퓨터 프로그램 제품은 서버로부터 클라이언트 장치 또는 제3 장치로 전송되거나, 제3 장치로부터 클라이언트 장치로 전송되는 S/W 프로그램 자체를 포함할 수 있다.The computer program product, in a system consisting of a server and a client device, may include a storage medium of the server or a storage medium of the client device. Alternatively, if there is a third device (eg, a smart phone) that is communicatively connected to the server or the client device, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include the S/W program itself transmitted from the server to the client device or the third device, or transmitted from the third device to the client device.

이 경우, 서버, 클라이언트 장치 및 제3 장치 중 하나가 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수 있다. 또는, 서버, 클라이언트 장치 및 제3 장치 중 둘 이상이 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 분산하여 실시할 수 있다.In this case, one of the server, the client device and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of a server, a client device, and a third device may execute a computer program product to distribute the method according to the disclosed embodiments.

예를 들면, 서버가 서버에 저장된 컴퓨터 프로그램 제품을 실행하여, 서버와 통신 연결된 클라이언트 장치가 개시된 실시예들에 따른 방법을 수행하도록 제어할 수 있다.For example, the server may execute a computer program product stored in the server to control a client device communicatively connected with the server to perform the method according to the disclosed embodiments.

이상에서 실시예들에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속한다.Although the embodiments have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention as defined in the following claims are also included in the scope of the present invention. belongs to

Claims

A method for a server providing a voice assistant service to detect a voice attack on a device, the method comprising:
receiving an input audio signal from the device;
comparing the received input audio signal with a plurality of audio signals received from the device and stored in the server;
identifying similar audio signals of the input audio signal from among the plurality of pre-stored audio signals based on a comparison result of the received input audio signal and the plurality of pre-stored audio signals;
converting the received input audio signal into first text by performing Automatic Speech Recognition (ASR);
obtaining transformed second texts from the identified similar audio signals;
comparing the first text converted from the input audio signal with the second text converted from the similar audio signals; and
determining whether the input audio signal is an audio signal for attacking the voice assistant service based on a comparison result of the first text and the second text;
includes,
wherein the plurality of pre-stored audio signals are provided from the device to the server before the input audio signal is received.

According to claim 1,
Comparing the received input audio signal with the plurality of pre-stored audio signals comprises:
converting the received input audio signal and the plurality of pre-stored audio signals into an audio data format;
comparing audio data converted from the received input audio signal and a plurality of audio data converted from the plurality of pre-stored audio signals;
including,
The audio data converted from the received audio signal and the plurality of audio data converted from the plurality of pre-stored audio signals are,
Normalization (Normalization), locality-preserving hashing (Locality-preserving hashing; LPH), the method that is transformed using at least one method of clustering (Clustering).

According to claim 1,
The step of identifying the similar audio signals comprises:
identifying, among the pre-stored audio signals, audio signals having a similarity with the input audio signal equal to or greater than a predetermined threshold as the similar audio signals; and
and generating a similar audio signal set by grouping N similar audio signals having the highest similarity to the input audio signal from among the identified similar audio signals.

4. The method of claim 3,
Obtaining converted second texts from the identified similar audio signals comprises:
The ASR is performed to convert the analogous audio signals included in the analogous audio signal set into texts to obtain the second texts.

4. The method of claim 3,
Obtaining converted second texts from the identified similar audio signals comprises:
and obtaining, as the second texts, texts corresponding to the analogous audio signals included in the identified analogous audio signal set from among the third texts pre-stored in the server.

6. The method of claim 5,
Comparing the texts comprises:
calculating difference values representing a difference between the first text and the second text;
The step of determining whether the input audio signal is an audio signal for an attack on the voice assistant service comprises:
Determining whether an audio signal for an attack on the voice assistant service is based on the calculated difference values between the texts.

7. The method of claim 6,
Comparing the texts comprises:
calculating difference values representing the difference between the first text and the second text by applying an edit distance algorithm based on at least one of a difference between characters, a difference between words, and a difference between pronunciation expressions How to.

7. The method of claim 6,
The step of determining whether the input audio signal is an audio signal for an attack on the voice assistant service comprises:
Among the difference values indicating the difference between the first text and the second texts, when the number of values having a difference value equal to or greater than the first threshold value is equal to or greater than the second threshold value, the input audio signal is converted to an audio for attack on the voice assistant service A method comprising determining with a signal.

9. The method of claim 8,
and the first threshold value and the second threshold value are set according to the number of the identified similar audio signals.

7. The method of claim 6,
The step of determining whether the input audio signal is an audio signal for an attack on the voice assistant service comprises:
and determining the audio signal for an attack on the voice assistant service by further using a comparison result between the received input audio signal and the plurality of pre-stored audio signals.

A server providing a voice assistant service for detecting a voice attack on a device, the server comprising:
a communication interface for performing data communication with the device;
a storage unit for storing a program including one or more instructions; and
A processor for executing one or more instructions of the program stored in the storage unit,
The processor is
control the communication interface to receive an input audio signal from the device;
comparing the received input audio signal with a plurality of audio signals received from the device and pre-stored in the storage;
based on a comparison result of the received input audio signal and the plurality of pre-stored audio signals, identify similar audio signals of the input audio signal from among the plurality of pre-stored audio signals;
ASR (Automatic Speech Recognition) is performed to convert the received input audio signal into a first text,
obtaining transformed second texts from the identified similar audio signals;
comparing the first text converted from the input audio signal with the second text converted from the similar audio signals;
Based on the comparison result of the first text and the second text, it is determined whether the input audio signal is an audio signal for attacking the voice assistant service,
The server, wherein the plurality of pre-stored audio signals are provided from the device to the server before the input audio signal is received.

12. The method of claim 11,
The processor is
converting the received input audio signal and the plurality of pre-stored audio signals into an audio data form;
Comparing the audio data converted from the received input audio signal and the plurality of audio data converted from the plurality of pre-stored audio signals,
The audio data converted from the received audio signal and the plurality of audio data converted from the plurality of pre-stored audio signals are,
Normalization (Normalization), locality-preserving hashing (Locality-preserving hashing; LPH), the server that is transformed using at least one method of clustering (Clustering).

12. The method of claim 11,
The processor is
identifying audio signals having a similarity with the input audio signal equal to or greater than a predetermined threshold among the pre-stored audio signals as the similar audio signals;
and generating a similar audio signal set by grouping N similar audio signals having the highest similarity to the input audio signal from among the identified similar audio signals.

14. The method of claim 13,
The processor is
and converting the analogous audio signals included in the analogous audio signal set into texts by performing the ASR to obtain the second texts.

14. The method of claim 13,
The processor is
and obtaining, as the second texts, texts corresponding to the analogous audio signals included in the identified analogous audio signal set from among the third texts previously stored in the storage unit.

16. The method of claim 15,
The processor is
calculating difference values representing a difference between the first text and the second text;
The server determines whether it is an audio signal for an attack on the voice assistant service based on the calculated difference values between the texts.

17. The method of claim 16,
The processor is
A server for calculating difference values indicating a difference between the first text and the second texts by applying an edit distance algorithm based on at least one of a difference between characters, a difference between words, and a difference between pronunciation expressions.

17. The method of claim 16,
The processor is
Among the difference values indicating the difference between the first text and the second texts, when the number of values having a difference value equal to or greater than the first threshold value is equal to or greater than the second threshold value, the input audio signal is converted to an audio for attack on the voice assistant service Judging by the signal, the server.

17. The method of claim 16,
The processor is
The server further determines as an audio signal for attacking the voice assistant service using a comparison result of the received input audio signal and the plurality of pre-stored audio signals.

A computer-readable recording medium in which a program for executing the method of claim 1 in a computer is recorded.