KR20220040997A

KR20220040997A - Electronic apparatus and control method thereof

Info

Publication number: KR20220040997A
Application number: KR1020210122742A
Authority: KR
Inventors: 쿠마르 란잔 사말; 시바무르티 쿠마르 프라빈 구바칼루; 푸루쇼타마 차우다리 고누군틀라; 벨고드 맨주나트 로카나트; 락스미나라얀 리투라즈 카브라
Original assignee: 삼성전자주식회사
Priority date: 2020-09-23
Filing date: 2021-09-14
Publication date: 2022-03-31

Abstract

Disclosed is a control method of an electronic device. The control method comprises: a step of identifying a characteristic of at least one user interface (UI) element included in a screen of the electronic device; a step of obtaining a database comprising an obtained natural language utterance based on the at least one characteristic of an identified UI element; a step of identifying whether or not the utterance of the received speech input matches the natural language utterance included in the obtained database, when the speech input is received; and a step of automatically accessing at least one UI element, when the utterance of the speech input is identified as matching the natural language utterance. Therefore, the present invention is capable of improving a user experience.

Description

Electronic apparatus and control method thereof

본 개시는 전자 장치 및 그 제어 방법에 관한 것으로, 더욱 상세하게는 UI를 제공하는 전자 장치 및 그 제어 방법에 관한 것이다. 본 개시는 2020년 9월 23일에 출원된 인도 가출원 번호 202041041137에 근거하여 우선권을 주장하며, 그 중 개시 5는 내용은 본 문서에 참조로 포함된다.The present disclosure relates to an electronic device and a control method thereof, and more particularly, to an electronic device providing a UI and a control method thereof. The present disclosure claims priority based on Indian Provisional Application No. 202041041137, filed on September 23, 2020, the contents of which are incorporated herein by reference.

일반적으로 전자 장치(예: 스마트 TV, 스마트폰 등)는 새로운 애플리케이션(예: Netflix, YouTube 등과 같은 스트리밍 비디오 애플리케이션)이 등장함에 따라 점점 더 강력해지고 정교해지고 있다. 비디오 애플리케이션의 채널 목록에 액세스하는 대신, 사용자는 여러 애플리케이션을 탐색할 수 있다. 각 애플리케이션에는 레이아웃 및 특성/기능이 있다. 이러한 발전에도 불구하고 터치 화면, 키보드, 리모콘 또는 마우스가 있는 전자 장치를 사용하는 것은 여전히 이례적이다. 지배적인 액세스 방식은 입력 범위가 제한되고 텍스트 입력 및 세분화된 탐색 또는 선택에 대한 지원이 부적절한 전자 장치와의 인터렉션을 계속하고 있다.In general, electronic devices (eg smart TVs, smartphones, etc.) are becoming more powerful and sophisticated with the advent of new applications (eg streaming video applications such as Netflix, YouTube, etc.). Instead of accessing the channel list of the video application, the user can browse through several applications. Each application has a layout and characteristics/functions. Despite these advances, the use of electronic devices with touch screens, keyboards, remote controls or mice is still unusual. The dominant access method continues to interact with electronic devices with limited input range and inadequate support for text input and fine-grained navigation or selection.

기존 방법/전자 장치 중 일부는 인터페이스 복잡성과 원격 제어 기능 간의 격차를 해소할 수 있는 여러 실행 가능한 방법을 제공한다. 음성 인식 기술이 그 중 하나이다. Alexa, Siri 및 Google Assistant와 같은 지능형 음성 에이전트/기능이 전자 장치에 통합되어 사용자가 음성 입력을 통해 전자 장치에서 액세스할 수 있는 점점 더 정교한 애플리케이션 모음을 관리할 수 있다. 예를 들어 "자비스, 좋은 아침"이라는 문구/명령/음성 입력은 사용자가 아침 루틴을 시작하도록 한다. 음성 입력을 받으면 지능형 음성 에이전트/지원은 조명을 켜고 날씨 관련 정보를 제공하고 뉴스 브리핑을 하고 커피 머신을 가동한다. 사용자는 전자 장치를 제어하기 위해 특정 문구/명령어를 암기해야 한다. 사용자는 전자 장치를 제어하기 위해 특정 문구/명령어를 암기해야 하는데, 특정 문구/명령어가 가전/전자 기기/IoT 기기와 대화하는 것이 약간 부자연스러워(예: 내일 [시간]에 [위치]에서 [이름]과 회의 일정 잡기) 이는 기술자가 아닌 사용자에게는 다소 어렵다.Some of the existing methods/electronics provide several viable ways to bridge the gap between interface complexity and remote control capabilities. Speech recognition technology is one of them. Intelligent voice agents/features such as Alexa, Siri and Google Assistant are integrated into electronic devices, allowing users to manage an increasingly sophisticated suite of applications accessible from their electronic devices via voice input. For example, a phrase/command/voice input of "Jarvis, good morning" causes the user to start a morning routine. Upon receiving voice input, the intelligent voice agent/assistant turns on lights, provides weather-related information, gives news briefings and activates coffee machines. The user has to memorize specific phrases/commands to control the electronic device. The user has to memorize a specific phrase/command to control the electronic device, and it is a bit unnatural for the specific phrase/command to converse with the home appliance/electronic device/IoT device (e.g., tomorrow [time] from [location] to [name] ] and scheduling a meeting), which is rather difficult for non-technical users.

기존 방법(2)/전자 장치(10)는 도 1에 도시된 바와 같이 인터페이스 복잡성과 원격 제어 기능 간의 격차를 줄이기 위한 방안을 제공한다. 예를 들어, 사용자(1)는 해당 UI에 표시된 특정 숫자를 발화함으로써 비디오 애플리케이션의 UI 요소(예: 비디오)에 액세스할 수 있다. 사용자(1)가 전자 장치(10)의 화면에 표시된 제3 비디오에 액세스하기를 원하는 다음 시나리오를 고려해야 한다. 비디오를 재생/액세스하려면 사용자(1)는 “3번 비디오 재생"과 같은 특정 음성 명령을 발화해야 한다. 그러나 전자 장치(10)의 화면에 표시되는 UI 요소에 액세스하기 위해서는 사용자(1)가 전자 장치(10)에 가까이 있어야 UI 요소의 이름 또는 수를 읽을 수 있으나, 이는 항상 실행 가능한 것은 아니다. 또한 특정 음성 명령을 발화하는 것은 UI 요소에 액세스/호출하는 자연스러운 방법이 아니다. 또한, 사용자(1)는 전자 장치(10)의 화면에 표시되지 않는 이 비디오 애플리케이션의 하위 기능 또는 하위 페이지에 액세스할 수 없다.The existing method (2)/electronic device 10 provides a method for reducing the gap between interface complexity and remote control function as shown in FIG. 1 . For example, the user 1 may access a UI element (eg, video) of a video application by uttering a specific number displayed on the UI. The following scenario should be considered in which the user 1 wants to access the third video displayed on the screen of the electronic device 10 . To play/access the video, the user 1 must utter a specific voice command, such as “play video 3”. However, in order to access the UI elements displayed on the screen of the electronic device 10, the user 1 must You must be close to the device 10 to read the name or number of UI elements, but this is not always feasible. Also, uttering certain voice commands is not a natural way to access/call UI elements. ) cannot access sub-functions or sub-pages of this video application that are not displayed on the screen of the electronic device 10 .

도 1에 도시된 바와 같이, 다른 기존 방법(3)/전자 장치(10)는 인터페이스 복잡성과 원격 제어 기능 간의 격차를 줄이기 위한 방안을 제공한다. 예를 들어, 사용자(1)는 "이름 표시" 라고 발화하여 UI 요소 이름이 있는 오버레이를 표시하거나 "이름 표시" 라고 발화하여 UI 요소 이름이 있는 오버레이를 표시하여 소셜 미디어 애플리케이션의 UI 요소 (예: 검색창, 홈 아이콘, 좋아요 버튼, 공유 버튼 등)에 액세스할 수 있다. UI 요소에 이름이 없거나 사용자(1)가 대신 숫자를 사용하는 것을 선호하는 경우, 사용자(1)는 "숫자 표시" 라고 말하여 전자 장치(10)의 화면에 각 UI 요소에 대한 숫자 태그를 표시할 수 있다. 그러나, 사용자(1)는 이 방법(3)을 사용하여 전자 장치(10)의 화면에 표시되는 UI 요소에만 액세스할 수 있다. 사용자(1)는 전자 장치(10)의 화면에 표시되지 않는 이 소셜 미디어 애플리케이션의 하위 기능 또는 하위 페이지에 액세스할 수 없다. 또한, 전자 장치(10)의 화면에 표시된 UI 요소에 액세스하려면 사용자(1)가 전자 장치(10)에 가까이 있어야 UI 요소의 이름 또는 숫자를 읽을 수 있으며, 이는 항상 실행 가능한 것은 아니다.As shown in FIG. 1 , another existing method 3/electronic device 10 provides a method for reducing the gap between interface complexity and remote control function. For example, user 1 can utter "show name" to display an overlay with a UI element name, or "show name" to display an overlay with a UI element name to display a UI element in a social media application (e.g.: You can access the search bar, home icon, like button, share button, etc.). If the UI element does not have a name or if the user 1 prefers to use numbers instead, the user 1 says "show numbers" to display a numeric tag for each UI element on the screen of the electronic device 10 can do. However, the user 1 can only access UI elements displayed on the screen of the electronic device 10 using this method 3 . The user 1 cannot access a sub-function or sub-page of this social media application that is not displayed on the screen of the electronic device 10 . In addition, in order to access the UI element displayed on the screen of the electronic device 10 , the user 1 must be close to the electronic device 10 to read the name or number of the UI element, which is not always executable.

다른 예시로, 사용자(1)가 도 1에 도시된 바와 같이 검색 페이지에 액세스하기를 원하는 시나리오를 고려해야 한다. 일부 기존 방법(4)/전자 장치(10)는 일반적으로 제한된 제어 및 검색 지침 세트를 제공하고 더 깊은 추천 또는 탐색 쿼리를 지원하지 않는다. 표 1은 제한된 제어 세트의 예를 나타낸다.As another example, consider the scenario where the user 1 wants to access a search page as shown in FIG. 1 . Some existing methods (4)/electronic devices (10) generally provide a limited set of control and search guidelines and do not support deeper recommendation or search queries. Table 1 shows an example of a limited control set.

이에 따라 음성 입력/음성 기반 인터렉션을 이용하여 전자 장치의 UI 요소에 자동으로 액세스하기 위한 유용한 대안을 제공하는 것이 필요하다. Accordingly, it is necessary to provide a useful alternative for automatically accessing UI elements of electronic devices using voice input/voice-based interaction.

본 개시는 상술한 필요성에 따른 것으로, 본 개시의 실시 예에 따르면 사용자의 음성 입력이 적어도 하나의 식별된 UI 요소의 예측 자연어(Natural Language, NL) 발화와 일치하는 경우, 전자 장치의 화면에 표시된 적어도 하나의 식별된 UI 요소(예: 실행 가능한 UI 요소, 실행 불가능한 UI 요소, 텍스트 UI 요소 및 비-텍스트 UI 요소)에 자동으로 액세스할 수 있다. 그 결과, 사용자는 자연어를 활용하여 식별된 다양한 UI 구성 요소에 액세스하고 사용자 경험을 향상시킬 수 있다. 자연어 발화는 적어도 하나의 식별된 UI 요소의 특성(예: 상대적 위치, 성능, 기능, 타입, 모양 등)을 기반으로 데이터베이스/지식 그래프를 획득하여 예측될 수 있다. 사용자의 음성 입력에는 데이터베이스에 나타난 적어도 하나의 식별된 UI 요소의 특성에 대한 발화가 포함된다.The present disclosure is in accordance with the above-described necessity, and according to an embodiment of the present disclosure, when a user's voice input matches a predicted Natural Language (NL) utterance of at least one identified UI element, the display is displayed on the screen of the electronic device. At least one identified UI element (eg, executable UI element, non-executable UI element, text UI element, and non-text UI element) may be automatically accessible. As a result, users can utilize natural language to access the various UI components identified and enhance the user experience. The natural language utterance may be predicted by acquiring a database/knowledge graph based on characteristics (eg, relative position, performance, function, type, shape, etc.) of at least one identified UI element. The user's voice input includes utterances for a characteristic of at least one identified UI element appearing in the database.

본 개시의 실시 예에 따르면, 다른 UI 요소들 중 적어도 하나의 UI 요소에서 각 UI 요소들 간의 유사도를 결정하여 지식 그래프를 획득할 수 있다. 각 UI 요소 간의 유사도는 각 UI 요소의 위치, 다른 UI 요소 중 각 UI 요소의 상대적 위치, 각 UI 요소의 성능, 각 UI 요소의 기능 및 전자 장치의 화면에 표시되는 각 UI 요소의 모양에 따라 결정될 수 있다. According to an embodiment of the present disclosure, a knowledge graph may be obtained by determining a degree of similarity between UI elements in at least one UI element among other UI elements. The degree of similarity between each UI element may be determined by the position of each UI element, the relative position of each UI element among other UI elements, the performance of each UI element, the function of each UI element, and the shape of each UI element displayed on the screen of the electronic device. can

본 개시의 실시 예에 따르면, 유사도에 기초하여 적어도 하나의 UI 요소의 각 UI 요소를 클러스터링할 수 있다. 나아가, 전자 장치는 가시적 특성(예: 하트 모양 아이콘) 및 다른 UI 요소 중 각 UI 요소의 상대적 위치로부터 텍스트가 아닌 UI 요소의 텍스트 표현을 결정할 수 있다. 또한, 전자 장치는 적어도 하나의 각 UI 요소 사이에 기 정의된 정보(예: 성능(누르기, 복사), 동사(예: 탭, 펀치, 클릭, 복사 및 복제), 타입(예: 버튼, 텍스트))를 매핑하고, 실행 가능한 UI 요소를 통해 전환되는 미리 정의된 화면 순서를 결정할 수 있다. According to an embodiment of the present disclosure, each UI element of at least one UI element may be clustered based on the degree of similarity. Furthermore, the electronic device may determine a text representation of a UI element other than text from a visible characteristic (eg, a heart-shaped icon) and a relative position of each UI element among other UI elements. In addition, the electronic device provides predefined information (eg, performance (press, copy), verb (eg, tap, punch, click, copy and duplicate)), type (eg, button, text) between at least one of each UI element. ) and determine a predefined screen order that transitions through executable UI elements.

본 개시의 실시 예에 따르면, 획득된 지식 그래프에 대한 시맨틱 변환을 수행하여 싱글 스텝(single-step) 의도 및/또는 멀티 스텝(multi-step) 의도에 대한 자연어 변형을 획득하는 것이다. 또한, 전자 장치는 획득된 지식 그래프를 사용하여 싱글 스텝 의도 및/또는 다중 스텝 의도에 대한 동작 및 액션 시퀀스를 식별하고, 획득된 자연어 변형을 식별된 액션 및 식별된 액션 시퀀스와 매핑함으로써 적어도 하나의 식별된 UI 요소에 대응되는 자연어 발화를 예측하기 위해 자연어 모델을 동적으로 획득한다. 그 결과, 사용자는 전자 장치의 화면에 표시되지 않은 애플리케이션/UI 요소의 하위 기능 또는 하위 페이지에 액세스하여 사용자 경험을 향상시킬 수 있다.According to an embodiment of the present disclosure, a natural language transformation for a single-step intent and/or a multi-step intent is acquired by performing semantic transformation on the acquired knowledge graph. In addition, the electronic device identifies an action and an action sequence for a single-step intent and/or a multi-step intent by using the acquired knowledge graph, and maps the acquired natural language variant to the identified action and the identified action sequence to obtain at least one A natural language model is dynamically acquired to predict a natural language utterance corresponding to the identified UI element. As a result, the user can improve the user experience by accessing a sub-function or sub-page of an application/UI element that is not displayed on the screen of the electronic device.

이상과 같은 목적을 달성하기 위한 본 발명의 일 실시 예에 따른 전자 장치의 음성 기반 인터렉션 방법은, 전자 장치의 화면에 디스플레이된 적어도 하나의 유저 인터페이스(UI) 요소를 식별하는 단계, 적어도 하나의 식별된 UI 요소의 특성을 판단하는 단계, 적어도 하나의 식별된 UI 요소의 적어도 하나의 특성에 기초하여 데이터베이스를 획득하는 단계, 상기 데이터베이스는 적어도 하나의 식별된 UI 요소들에 대응되는 자연어(NL) 발화를 포함하고, 자연어 발화은 적어도 하나의 식별된 UI 요소들의 적어도 하나의 특성에 기초하여 예측되는 것을 특징으로 하며, 전자 장치의 사용자로부터 음성 입력을 수신하는 단계, 상기 음성 입력은 전자 장치의 화면 상에 디스플레이된 적어도 하나의 식별된 UI 요소들의 적어도 하나의 특성을 나타내는 발화를 포함하며, 수신된 음성 입력의 발화가 획득된 데이터 베이스 내의 적어도 하나의 자연어 발화과 매칭되는지 여부를 결정하는 단계, 및 전자 장치에 의하여, 사용자로부터 수신된 음성 입력의 발화들이 적어도 하나의 식별된 UI 요소들의 예측된 자연어 발화과 매칭되는 것으로 판단되면, 적어도 하나의 UI 요소들 중 적어도 하나의 UI 요소를 자동으로 액세스하는 단계를 포함한다.A voice-based interaction method of an electronic device according to an embodiment of the present invention for achieving the above object, the step of identifying at least one user interface (UI) element displayed on a screen of the electronic device, at least one identification determining a characteristic of the identified UI element, obtaining a database based on at least one characteristic of the at least one identified UI element, wherein the database is a natural language (NL) utterance corresponding to the at least one identified UI element comprising, wherein the natural language utterance is predicted based on at least one characteristic of the at least one identified UI element, and receiving a voice input from a user of the electronic device, the voice input being displayed on a screen of the electronic device. determining whether an utterance of the received voice input matches at least one natural language utterance in an acquired database, the utterance comprising an utterance indicative of at least one characteristic of the displayed at least one identified UI elements, and to the electronic device; and automatically accessing at least one of the at least one UI elements when it is determined that the utterances of the voice input received from the user match the predicted natural language utterances of the at least one identified UI elements. .

일 실시 예에 따르면, 적어도 하나의 UI 요소들은 실행 가능한 UI 요소, 실행 불가능한 UI 요소, 텍스트 UI 요소, 및 비-텍스트 UI 요소 중 적어도 하나를 포함할 수 있다.According to an embodiment, the at least one UI element may include at least one of an executable UI element, a non-executable UI element, a text UI element, and a non-text UI element.

일 실시 예에 따르면, 상기 데이터베이스를 획득하는 단계는 지식 그래프를 획득하는 단계 및 획득된 지식 그래프를 데이터베이스에 저장하는 단계를 더 포함한다. 지식 그래프는 각 UI 요소의 위치, 다른 UI 요소 간의 각 UI 요소의 상대적 위치, 각 UI 요소의 함수, 각 UI 요소의 타입, 전자 장치의 화면에 표시되는 각 UI 요소의 모양에 의해 획득된다. 또한, 본 방법은 각 UI 요소의 위치, 다른 UI 요소 중 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 기능 및 전자 장치의 스크린 상에 디스플레이된 각 UI 요소의 외형에 기초하여 다른 UI 요소 중 적어도 하나의 UI 요소의 각 UI 요소 간의 유사도를 결정하는 단계를 포함한다. 또한, 본 방법은 유사도에 기초하여 적어도 하나의 UI 요소들의 각 UI 요소를 클러스터링하는 단계를 포함한다. 또한, 본 방법은 가시적 특성으로부터의 비-텍스트 UI 요소의 텍스트 표현 및 다른 UI 요소들 중 각 UI 요소의 상대적 위치를 결정하는 단계를 포함한다. 또한, 이 방법은 적어도 하나의 UI 요소들의 각 UI 요소 및 각 UI 요소의 기능에 대응되는 기 정의된 정보를 매핑하는 단계를 포함한다. 또한, 상기 방법은 상기 실행 가능한 UI 요소를 통해 전이된 미리 정의된 스크린 시퀀스를 결정하는 단계를 포함한다.According to an embodiment, acquiring the database further includes acquiring the knowledge graph and storing the acquired knowledge graph in the database. The knowledge graph is obtained by the position of each UI element, the relative position of each UI element between other UI elements, the function of each UI element, the type of each UI element, and the shape of each UI element displayed on the screen of the electronic device. In addition, the method is based on the position of each UI element, the relative position of each UI element among other UI elements, the function of each UI element, the function of each UI element, and the appearance of each UI element displayed on the screen of the electronic device. and determining a degree of similarity between each UI element of at least one UI element among other UI elements. The method also includes clustering each UI element of the at least one UI element based on the degree of similarity. The method also includes determining the textual representation of the non-text UI element from the visible characteristic and the relative position of each UI element among other UI elements. Also, the method includes mapping each UI element of the at least one UI element and predefined information corresponding to a function of each UI element. The method also includes determining a predefined screen sequence transitioned through the executable UI element.

상기 적어도 하나의 식별된 UI 요소들의 자연어 발화를 예측하는 단계는 싱글 스텝 의도 및 멀티 스텝 의도 중 적어도 하나에 대한 자연어 변형들을 획득하기 위해 획득된 지식 그래프에 대한 의미 변환을 수행하는 단계, 획득된 지식 그래프를 사용하여 적어도 하나의 싱글 스텝 의도 및 멀티 스텝 의도에 대한 액션 및 액션 시퀀스 중 적어도 하나를 식별하는 단계, 획득된 자연어 변형들을 적어도 하나의 식별된 액션 및 적어도 하나의 식별된 액션 시퀀스와 맵핑함으로써 적어도 하나의 식별된 UI 요소들의 자연어 발화를 예측하기 위한 자연어 모델을 동적으로 획득하는 단계를 포함할 수 있다.Predicting the natural language utterance of the at least one identified UI element may include performing semantic transformation on the acquired knowledge graph to obtain natural language variants for at least one of a single-step intent and a multi-step intent, the acquired knowledge identifying at least one of an action and an action sequence for the at least one single-step intent and the multi-step intent using the graph, mapping the obtained natural language variants with the at least one identified action and the at least one identified action sequence; The method may include dynamically acquiring a natural language model for predicting natural language utterance of at least one identified UI element.

일 실시 예에 따르면, 상기 시맨틱 번역을 수행하는 단계는 획득된 지식 그래프를 수신하는 단계, 수신된 지식 그래프 각 UI 요소를 전자 장치의 화면 상의 도메인, 동사, 동의어, 슬롯, 슬롯 타입, 텍스처로 표현된 슬롯, 성능, 및 상대 위치 중 적어도 하나로 카테고리화하는 단계 및, 카테고리화에 기초하여 자연어 변형들을 획득하는 단계를 포함할 수 있다.According to an embodiment, the performing of the semantic translation includes receiving the acquired knowledge graph, and expressing each UI element of the received knowledge graph as a domain, a verb, a synonym, a slot, a slot type, and a texture on the screen of the electronic device. categorizing into at least one of a given slot, performance, and relative position, and obtaining natural language variants based on the categorization.

일 실시 예에 따르면, 상기 적어도 하나의 싱글 스텝 의도와 멀티 스텝 의도에 대한 상기 액션 및 상기 액션 시퀀스 중 적어도 하나를 식별하는 단계는 획득된 지식 그래프를 수신하는 단계 및 각 UI 요소의 성능에 기초하여 액션 루틴들을 결정하는 단계를 더 포함할 수 있다. According to an embodiment, the step of identifying at least one of the action and the action sequence for the at least one single-step intent and the multi-step intent is based on the step of receiving an acquired knowledge graph and the performance of each UI element. The method may further include determining action routines.

일 실시 예에 따르면, 상기 자연어 모델을 동적으로 획득하는 단계는, 상기 획득된 자연어 변형들 및 상기 적어도 하나의 식별된 액션을 맵핑함으로써 적어도 하나의 식별된 UI 요소들의 자연어 발화를 예측하기 위해 상기 자연어 모델을 동적으로 획득하는 단계를 포함할 수 있다. 상기 적어도 하나의 식별된 액션 시퀀스는 유사한 자연어 변형들을 클러스터링하는 단계, 유사한 자연어 변형들에 대한 동적 의도를 할당하는 단계, 상기 적어도 하나의 식별된 액션 및 상기 적어도 하나의 식별된 액션 시퀀스와 상기 동적 의도를 연관시키는 단계, 상기 클러스터링된 자연어 변형, 상기 동적 의도, 및 상기 액션 루틴들에 기초하여 상기 자연어 모델을 동적으로 획득하는 단계 및 동적으로 획득된 자연어 모델을 데이터베이스에 저장하는 단계를 더 포함할 수 있다.According to an embodiment, the step of dynamically obtaining the natural language model comprises: predicting natural language utterance of at least one identified UI element by mapping the obtained natural language variants and the at least one identified action. It may include dynamically acquiring the model. the at least one identified action sequence clustering similar natural language variants, assigning a dynamic intent to similar natural language variants, the at least one identified action and the at least one identified action sequence and the dynamic intent The method may further include: associating , dynamically acquiring the natural language model based on the clustered natural language transformation, the dynamic intent, and the action routines, and storing the dynamically acquired natural language model in a database. there is.

일 실시 예에 따르면, 상기 수신된 음성 입력의 발화들이 상기 적어도 하나의 식별된 UI 요소들의 예측된 자연어 발화과 매칭하는지 여부를 결정하는 단계는, 스크린(140) 상에 디스플레이된 텍스트 표현/텍스트 UI 요소의 형태의 스크린 정보를 판독함으로써 텍스트 표현 스코어를 결정하는 단계, 텍스트 표현/텍스트 UI 요소로부터 동사 및 명사에 대한 동의어를 추출하고, 동의어 스코어를 할당하는 단계, 동적 언어 생성기를 가중 학습하는데 이용되는 자연어 변형들을 연관시키는 분산 스코어를 결정하는 단계, 수신된 음성 입력의 발화에 언급된 참조 오브젝트들과 근처의 요소 정보를 비교함으로써 관련성 스코어를 결정하는 단계, 및 동적 언어 생성기에 대해 수신된 음성 입력의 발화와 매칭되는 최종 스코어로서 매칭 스코어를 결정하는 단계, 및 결정된 스코어를 관련성 스코어와 결합하는 단계를 포함할 수 있다. 매칭 스코어는 동작이 실행될 타겟 요소(즉, UI 요소)를 결정한다. 유사한 스코어들의 경우, 사용자는 충돌하는 요소들 중에서 선택하기 위한 옵션을 제시한다.According to an embodiment, the determining whether the utterances of the received voice input match the predicted natural language utterances of the at least one identified UI elements include: a text representation/text UI element displayed on the screen 140 . Determining a text representation score by reading screen information in the form of determining a variance score associating variants, determining a relevance score by comparing reference objects mentioned in an utterance of the received speech input with nearby element information, and an utterance of the received speech input for a dynamic language generator determining a matching score as a final score that matches with , and combining the determined score with a relevance score. The match score determines the target element (ie the UI element) on which the action will be executed. For similar scores, the user is presented with an option to choose among the conflicting elements.

일 실시 예에 따르면, 상기 적어도 하나의 식별된 UI 요소들의 적어도 하나의 특성은 각 UI 요소의 위치를 포함하고, 다른 UI 요소 중 각 UI 요소의 상대적인 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 각 요소의 타입, 및 전자 장치의 화면 상에 디스플레이되는 각 UI 요소의 외형을 포함할 수 있다.According to an embodiment, the at least one characteristic of the at least one identified UI element includes a position of each UI element, a relative position of each UI element among other UI elements, a function of each UI element, and a function of each UI element. It may include the performance, the type of each element, and the appearance of each UI element displayed on the screen of the electronic device.

본 개시의 실시 예들에서 제공되는 음성-기반 인터렉션 전자 장치는 메모리와 상기 프로세서에 동작 가능하게 연결된 인터렉션 엔진을 포함한다. 상기 인터렉션 엔진은 전자 장치의 화면에 디스플레이된 적어도 하나의 유저 인터페이스 (UI) 요소를 식별하고, 적어도 하나의 식별된 UI 요소의 특성을 식별하고, 적어도 하나의 식별된 UI 요소의 적어도 하나의 특성에 기초하여 데이터베이스를 획득하고, 상기 데이터베이스는 적어도 하나의 식별된 UI 요소들에 대응되는 자연어(NL) 발화들을 포함하고, 자연어 발화는 적어도 하나의 식별된 UI 요소들의 적어도 하나의 특성에 기초하여 예측되며, 전자 장치의 사용자로부터 음성 입력을 수신하고, 상기 음성 입력은 전자 장치의 화면 상에 디스플레이된 적어도 하나의 식별된 UI 요소들의 적어도 하나의 특성을 나타내는 발화를 포함하며, 수신된 음성 입력의 발화가 획득된 데이터베이스 내의 적어도 하나의 자연어 발화과 매칭되는지 여부를 결정하고, 사용자로부터 수신된 음성 입력의 발화들이 적어도 하나의 식별된 UI 요소들의 예측된 자연어 발화과 매칭되는 것으로 판단되면, 적어도 하나의 UI 요소들 중 적어도 하나의 UI 요소를 자동으로 액세스하는 것을 특징으로 한다.The voice-based interaction electronic device provided in the embodiments of the present disclosure includes a memory and an interaction engine operatively connected to the processor. The interaction engine identifies at least one user interface (UI) element displayed on the screen of the electronic device, identifies a characteristic of the at least one identified UI element, and determines the at least one characteristic of the at least one identified UI element. obtain a database based on: the database includes natural language (NL) utterances corresponding to the at least one identified UI elements, the natural language utterances being predicted based on at least one characteristic of the at least one identified UI elements; , receive a voice input from a user of the electronic device, wherein the voice input includes an utterance indicating at least one characteristic of at least one identified UI element displayed on a screen of the electronic device, wherein the utterance of the received voice input is Determine whether utterances of the speech input received from the user match at least one natural language utterance in the obtained database, and if it is determined that utterances of the at least one identified UI elements match the predicted natural language utterances of the at least one identified UI elements, one of the at least one UI elements characterized by automatically accessing at least one UI element.

일 실시 예에서, 우선 순위(priority)는 전자 장치에 공급되는 특정 우선순위 규칙들에 기초하여 충돌하는 명령들 사이에서 결정된다(예, Go to Home.) 여기서, 홈(home)은 어플리케이션 홈(home) 또는 시스템 홈(system home) 을 의미할 수 있다.In an embodiment, priority is determined between conflicting commands based on specific priority rules supplied to the electronic device (eg, Go to Home.) Here, home is an application home ( home) or system home.

도 1은 관련 기술을 설명하기 위한 도면이다.
도 2a 및 도 2b는 일 실시 예에 따른 전자 장치의 구성을 나타내는 도면들이다.
도 3은 일 실시 예에 따른 전자 장치의 제어 방법을 설명하기 위한 흐름도이다.
도 4는 일 실시 예에 따른 모델을 동적으로 구축하는 방법을 설명하기 위한 흐름도이다.
도 5는 일 실시 예에 따른 모델을 동적으로 구축하는 방법을 설명하기 위한 흐름도이다.
도 6은 일 실시 예에 따른 스코어 시스템을 설명하기 위한 도면이다.
도 7은 일 실시 예에 따른 모델을 동적으로 구축하는 예시 시나리오를 설명하기 위한 도면이다.
도 8은 일 실시 예에 따른 관련 기술과 제안된 방법 간 차이를 설명하기 위한 도면이다.
도 9a 내지 도 9c는 일 실시 예에 따른 자연어 합성기와 관련된 기능을 설명하기 위한 도면들이다. 1 is a view for explaining the related technology.
2A and 2B are diagrams illustrating a configuration of an electronic device according to an exemplary embodiment.
3 is a flowchart illustrating a method of controlling an electronic device according to an exemplary embodiment.
4 is a flowchart illustrating a method of dynamically building a model according to an embodiment.
5 is a flowchart illustrating a method of dynamically building a model according to an embodiment.
6 is a diagram for describing a score system according to an embodiment.
7 is a diagram for describing an example scenario of dynamically building a model according to an embodiment.
8 is a diagram for explaining a difference between a related technology and a proposed method according to an embodiment.
9A to 9C are diagrams for explaining functions related to a natural language synthesizer according to an embodiment.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 개시에 대해 구체적으로 설명하기로 한다.Terms used in this specification will be briefly described, and the present disclosure will be described in detail.

본 명세서에서, "가진다," "가질 수 있다," "포함한다," 또는 "포함할 수 있다" 등의 표현은 해당 특징(예: 수치, 기능, 동작, 또는 컴포넌트 등의 구성요소)의 존재를 가리키며, 추가적인 특징의 존재를 배제하지 않는다.In this specification, expressions such as “have,” “may have,” “include,” or “may include” indicate the presence of a corresponding characteristic (eg, a numerical value, function, operation, or component such as a component). and does not exclude the presence of additional features.

A 또는/및 B 중 적어도 하나라는 표현은 "A" 또는 "B" 또는 "A 및 B" 중 어느 하나를 나타내는 것으로 이해되어야 한다. The expression "at least one of A and/or B" is to be understood as indicating either "A" or "B" or "A and B".

본 명세서에서 사용된 "제1," "제2," "첫째," 또는 "둘째,"등의 표현들은 다양한 구성요소들을, 순서 및/또는 중요도에 상관없이 수식할 수 있고, 한 구성요소를 다른 구성요소와 구분하기 위해 사용될 뿐 해당 구성요소들을 한정하지 않는다. As used herein, expressions such as "first," "second," "first," or "second," can modify various elements, regardless of order and/or importance, and refer to one element. It is used only to distinguish it from other components, and does not limit the components.

어떤 구성요소(예: 제1 구성요소)가 다른 구성요소(예: 제2 구성요소)에 "(기능적으로 또는 통신적으로) 연결되어((operatively or communicatively) coupled with/to)" 있다거나 "연결되어(connected to)" 있다고 언급된 때에는, 어떤 구성요소가 다른 구성요소에 직접적으로 연결되거나, 다른 구성요소(예: 제3 구성요소)를 통하여 연결될 수 있다고 이해되어야 할 것이다. A component (eg, a first component) is "coupled with/to (operatively or communicatively)" to another component (eg, a second component); When referring to "connected to", it should be understood that an element may be directly connected to another element or may be connected through another element (eg, a third element).

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 컴포넌트 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 적어도 하나의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 컴포넌트 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this application, terms such as "comprises" or "consisting of" are intended to designate that a feature, number, step, operation, element, component, or a combination thereof described in the specification exists, but at least one other feature It should be understood that it does not preclude the possibility of the presence or addition of numbers, steps, acts, elements, components, or combinations thereof.

본 개시에서 "모듈" 혹은 "부"는 적어도 하나의 기능이나 동작을 수행하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 "모듈" 혹은 복수의 "부"는 특정한 하드웨어로 구현될 필요가 있는 "모듈" 혹은 "부"를 제외하고는 적어도 하나의 모듈로 일체화되어 적어도 하나의 프로세서(미도시)로 구현될 수 있다.In the present disclosure, a “module” or “unit” performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of “modules” or a plurality of “units” are integrated into at least one module and implemented with at least one processor (not shown) except for “modules” or “units” that need to be implemented with specific hardware. can be

이하에서 설명되는 실시 예들 및 그에 따른 다양한 특징 및 이점들은 첨부 도면에서 도시된 비제한적 실시 예들을 참조하여 아래 발명의 상세한 설명과 함께 보다 상세히 설명하도록 한다. 공지된 컴포넌트 및 프로세스 기술에 대한 구체적인 설명이 아래 실시 예들의 요지를 흐리게 하지 않도록 그 상세한 설명을 생략한다. 또한, 아래에서 설명된 다양한 실시 예들은 반드시 상호 배타적일 필요가 없으며, 일부 실시 예들은 다른 적어도 하나의 실시 예들과 결합되어 새로운 실시 예를 형성할 수 있다. 이하에서 사용된 용어 "또는"은 문맥상 명백하게 다르게 뜻하지 않는 한, 비-배타적인 것 또는 배타적인 것을 의미할 수 있다. 이하에서 사용된 예시들은 단지 아래 실시 예들이 실시될 수 있는 방법의 이해를 돕고, 당업자가 아래 실시 예들을 실시 가능하도록 사용된다. 따라서, 예시들은 아래 실시 형태에 대해 한정하려는 것이 아닌 것으로 이해되어야 한다.The embodiments described below and various features and advantages thereof will be described in more detail along with the detailed description of the invention below with reference to the non-limiting embodiments shown in the accompanying drawings. Detailed descriptions of well-known components and process technologies are omitted so as not to obscure the gist of the embodiments below. In addition, various embodiments described below are not necessarily mutually exclusive, and some embodiments may be combined with at least one other embodiment to form a new embodiment. The term "or" as used hereinafter may mean non-exclusive or exclusive, unless the context clearly dictates otherwise. The examples used below merely aid in understanding how the embodiments below may be practiced, and are used to enable those skilled in the art to practice the embodiments below. Accordingly, it should be understood that the examples are not intended to be limiting on the embodiments below.

해당 분야에 관행으로, 실시 예들은 설명된 기능 또는 기능들을 수행하는 블록에 관하여 설명되고 도시될 수 있다. 이런 블록은, 매니저, 유닛, 모듈, 하드웨어 컴포넌트 등의 의미를 가지며, 로직 게이트, 직접 회로, 마이크로프로세서, 마이크로컨트롤러, 메모리 회로, 수동적 전자 컴포넌트, 능동적 전자 컴포넌트, 광학의 컴포넌트, 하드와이어에 내장된 회로, 등과 같은 아날로그 및/또는 디지털 회로들에 의해 물리적으로 구현되며, 펌웨어에 의해 선택적으로 구동될 수 있다. 회로는, 예를 들어, 적어도 하나의 반도체 칩 또는 인쇄 회로 기판과 같은 기판 서포트 상에 구현 가능하다. 블록을 구성하는 회로들은 전용 하드웨어, 또는 프로세서(예를 들어, 적어도 하나의 프로그램 된 마이크로프로세서 및 관련 회로)에 의해 구현 가능하거나, 블록의 일부 기능을 수행하기 위한 전용 하드웨어의 조합 및 블록의 다른 기능들을 수행하기 위한 프로세서에 의해 수행 가능하다. 실시 예의 각 블록은 본 개시에 기재된 권리 범위를 벗어나지 않으면서 둘 이상의 인터렉션하고 별개인 블록들로 물리적으로 분리 가능하다. 마찬가지로, 실시 예의 블록들은 본 문서에 기재된 권리 범위를 벗어나지 않으면서 더 많은 복합 블록들로 물리적 결합이 가능하다.As is customary in the art, embodiments may be described and illustrated with respect to a described function or block performing the functions. These blocks have the meanings of managers, units, modules, hardware components, etc., and include logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, and built-in hardware. Physically implemented by analog and/or digital circuits, such as circuitry, and the like, and may be selectively driven by firmware. The circuit is embodied, for example, on at least one semiconductor chip or on a substrate support such as a printed circuit board. Circuits constituting the block may be implemented by dedicated hardware, or a processor (eg, at least one programmed microprocessor and related circuits), or a combination of dedicated hardware for performing some functions of the block and other functions of the block It can be performed by a processor for performing the Each block of the embodiment is physically separable into two or more interactive and separate blocks without departing from the scope of rights described in the present disclosure. Similarly, blocks of the embodiment can be physically combined into more complex blocks without departing from the scope of the rights described in this document.

따라서, 이하에서 설명되는 실시 예들은 전자 장치의 음성 기반 인터렉션을 위한 방법을 설명한다. 방법은, 전자 장치의 화면에 표시된 적어도 하나의 UI 요소들을 식별하고, 식별된 적어도 하나의 UI 요소들의 특성을 식별하고, 식별된 적어도 하나의 UI 요소들의 특성을 기초로 데이터베이스를 획득한다. 데이터베이스는 식별된 적어도 하나의 UI 요소들의 자연어 발화들을 포함하며, 자연어 발화는 식별된 적어도 하나의 UI 요소들의 최소 하나의 특징을 기초로 예측된다. Accordingly, embodiments to be described below describe a method for voice-based interaction of an electronic device. The method identifies at least one UI element displayed on a screen of an electronic device, identifies a characteristic of the identified at least one UI element, and obtains a database based on the identified characteristic of the at least one UI element. The database includes natural language utterances of the identified at least one UI elements, wherein the natural language utterance is predicted based on at least one characteristic of the identified at least one UI elements.

또한, 방법은, 전자 장치의 사용자로부터 음성 입력을 수신하고, 음성 입력은 데이터베이스에 저장된 적어도 하나의 UI 요소들의 특성들을 나타낼 수 있다. 또한, 수신된 음성 입력의 발화가 식별된 적어도 하나의 UI 요소들의 예측된 자연어 발화과 매칭되는지 여부를 결정하고, 사용자로부터 수신된 음성 입력의 발화들이 식별된 적어도 하나의 UI 요소들의 예측된 자연어 발화과 일치하는 것으로 식별되면, 적어도 하나의 UI 요소들 중 UI 요소에 자동으로 액세스(또는 접근 또는 실행)하는 단계를 포함할 수 있다. Also, the method may receive a voice input from a user of the electronic device, and the voice input may indicate characteristics of at least one UI element stored in a database. Also, determine whether the utterances of the received voice input match the predicted natural language utterances of the identified at least one UI elements, and determine whether utterances of the voice input received from the user match the predicted natural language utterances of the identified at least one UI elements When it is identified as doing so, the method may include automatically accessing (or accessing or executing) a UI element among at least one UI element.

따라서, 이하에서 설명되는 실시 예들은 음성 기반 인터렉션을 위한 전자 장치를 제공한다. 전자 장치는 프로세서 및 메모리와 결합된 인터렉션 엔진을 포함한다. 인터렉션 엔진은 전자 장치의 화면에 표시된 적어도 하나의 UI 요소들을 식별한다. 또한, 인터렉션 엔진은 식별된 적어도 하나의 UI 요소들의 특성을 결정한다. 또한, 인터렉션 엔진은 식별된 적어도 하나의 UI 요소들의 특성을 기초로 데이터베이스를 획득하여 식별된 적어도 하나의 UI 요소들의 자연어 발화를 예측한다. 또한, 인터렉션 엔진은 전자 장치의 사용자로부터 음성 입력을 수신하고, 음성 입력은 데이터베이스에 제시된 식별된 적어도 하나의 UI 요소들의 특성을 가리키는 발화를 포함한다. 또한, 인터렉션 엔진은 수신된 음성 입력의 발화가 식별된 적어도 하나의 UI 요소들의 자연어 발화과 매칭되는지 여부를 결정한다. 또한, 인터렉션 엔진은 사용자로부터 수신된 음성 입력의 발화들이 식별된 적어도 하나의 UI 요소들의 예측된 자연어 발화와 매칭되는 경우 UI 요소들 중 적어도 하나의 UI 요소들에 자동으로 액세스(또는 실행)한다. Accordingly, embodiments described below provide an electronic device for voice-based interaction. The electronic device includes an interaction engine coupled with a processor and memory. The interaction engine identifies at least one UI element displayed on the screen of the electronic device. Further, the interaction engine determines a characteristic of the identified at least one UI element. In addition, the interaction engine predicts natural language utterance of the identified at least one UI element by acquiring a database based on the characteristic of the identified at least one UI element. Further, the interaction engine receives a voice input from a user of the electronic device, wherein the voice input includes an utterance indicating a characteristic of the identified at least one UI element presented in the database. Further, the interaction engine determines whether the utterance of the received voice input matches the natural language utterance of the identified at least one UI element. Further, the interaction engine automatically accesses (or executes) at least one of the UI elements when the utterances of the voice input received from the user match the predicted natural language utterances of the identified at least one UI elements.

종래 방식과 달리, 제안된 방법은 식별된 적어도 하나의 UI 요소들의 예상된 자연어 발화와 일치할 경우, 화면에 표시된 식별된 적어도 하나의 UI 요소(예를 들어, 실행 가능한(actionable) UI 요소, 실행불가능한(non-actionable) UI 요소, 텍스트 UI 요소 및 비-텍스트(non-textual) UI 요소)에 자동으로 액세스 가능하게 한다. 그로 인해, 사용자는 자연어를 활용하여 식별된 다양한 UI 컴포넌트들에 액세스할 수 있고 이에 따라 사용자의 경험이 향상될 수 있다. 자연어 발화는 식별된 적어도 하나의 UI 요소들의 특성(예를 들어, 상대적 위치, 성능, 기능, 종류, 형상, 등)을 기초로 데이터베이스/지식 그래프를 획득하여 예측된다. 사용자의 음성 입력은 데이터베이스에 존재하는 식별된 적어도 하나의 UI 요소들의 특성의 발화를 포함할 수 있다.Unlike the conventional method, the proposed method executes the identified at least one UI element displayed on the screen (eg, an actionable UI element, Automatically make accessible non-actionable UI elements, text UI elements, and non-textual UI elements. Thereby, the user may utilize the natural language to access various UI components identified, and thus the user's experience may be improved. The natural language utterance is predicted by obtaining a database/knowledge graph based on characteristics (eg, relative location, performance, function, type, shape, etc.) of the identified at least one UI element. The user's voice input may include an utterance of a characteristic of the identified at least one UI element existing in the database.

종래의 방식과 달리, 제안된 방법은 다른 UI 요소들 중 적어도 하나의 UI 요소와 각 UI 요소 간의 유사도를 결정하고, 유사도에 기초하여 적어도 하나의 UI 요소들 중 각 UI 요소를 클러스터링하고, 가시적 특징들로부터 비-텍스트 UI 요소들의 텍스트 표현 및 다른 UI 요소들 중 각 UI 요소의 상대적 위치를 결정하고, 적어도 하나의 UI 요소들 중 각 UI 요소 간의 기 정의된 정보 및 대응되는 각 UI 요소들의 성능을 맵핑하고, 실행 가능한 UI 요소를 통해 전환된 기 정의된 화면 시퀀스를 결정하여 지식 그래프를 획득할 수 있다. 다른 UI 요소들 중에서 적어도 하나의 UI 요소와 각 UI 요소 간의 유사도는 각 UI 요소의 위치, 다른 UI 요소들 중에서 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 화면에 표시된 각 UI 요소의 외형(또는 형상)을 기초로 결정된다.Unlike the conventional method, the proposed method determines a degree of similarity between at least one UI element among other UI elements and each UI element, clusters each UI element among at least one UI element based on the similarity degree, and includes a visible feature determine the text representation of non-text UI elements and the relative position of each UI element among other UI elements, and obtain predefined information between each UI element among at least one UI element and the performance of each corresponding UI element. A knowledge graph may be obtained by mapping and determining a predefined screen sequence converted through an executable UI element. The degree of similarity between at least one UI element and each UI element among other UI elements is the position of each UI element, the relative position of each UI element among other UI elements, the function of each UI element, the performance of each UI element, and the It is determined based on the appearance (or shape) of each UI element.

종래의 방식과 달리, 제안된 방법은 획득된 지식그래프에 대한 시맨틱 번역을 수행하여 싱글 스텝 의도 및/또는 멀티 스텝 의도를 위해 자연어 변형을 획득한다. 더욱이, 전자 장치는 획득된 지식 그래프를 사용하여 싱글 스텝 의도 및/또는 멀티 스텝 의도를 위한 액션(action) 및 액션 시퀀스를 식별하고, 식별된 액션 및 식별된 액션 시퀀스로 자연어 변형을 맵핑하여 자연어 모델을 동적으로 획득하여 식별된 적어도 하나의 UI 요소들을 예측한다. 그 결과, 사용자는 전자 장치의 화면에는 표시되지 않고 사용자의 경험을 향상시켜주는 표시된 어플리케이션/UI 요소의 하위 기능 또는 하위-페이지에 액세스할 수 있다.Unlike the conventional method, the proposed method performs semantic translation on the obtained knowledge graph to obtain natural language transformations for single-step intent and/or multi-step intent. Moreover, the electronic device uses the obtained knowledge graph to identify actions and action sequences for single-step intents and/or multi-step intents, and maps natural language variants to the identified actions and identified action sequences to model the natural language model. predicts the identified at least one UI element by dynamically acquiring As a result, the user can access a sub-function or sub-page of the displayed application/UI element that enhances the user's experience without being displayed on the screen of the electronic device.

보다 자세하게 도 2a 내지 도 9c를 참조하면, 유사 도면 부호는 도면에서 일관성 있게 대응되는 특징을 표시하도록 실시 예에 나타내고 있다.Referring to FIGS. 2A to 9C in more detail, like reference numerals are indicated in the embodiments to consistently indicate corresponding features in the drawings.

도 2a는 일 실시 예에 따른 전자 장치의 구성을 나타내는 블럭도이다. 2A is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.

전자 장치(100)는, 예를 들어, TV, 스마트폰, 태블릿, 랩탑, 오브젝트의 인터넷 (IoT) 장치, 스마트 글래스, 스마트 워치 등일 수 있지만, 이에 한정되는 것은 아니다. 일 실시 예에 따르면, 전자 장치(100)는 메모리(110) 및 프로세서(120)를 포함함한다. The electronic device 100 may be, for example, a TV, a smartphone, a tablet, a laptop, an Internet of Things (IoT) device, smart glasses, or a smart watch, but is not limited thereto. According to an embodiment, the electronic device 100 includes a memory 110 and a processor 120 .

메모리(110)는 식별된 적어도 하나의 UI 요소들 및 데이터베이스/지식 그래프의 특징, 식별된 적어도 하나의 UI 요소들의 특성을 저장한다. 또한, 메모리(110)는 프로세서(120)에 의해 실행되는 인스트럭션들을 저장한다. 메모리(110)는 비-휘발성 저장 요소들을 포함할 수 있다. 이런 비-휘발성 저장 요소들의 예시로서, 자력 하드 디스크들, 광학 디스크들, 플로피 디스크들, 플래시 메모리들, 또는 전기적으로 프로그램 가능한 메모리(EPROM) 또는 전기적으로 삭제가 가능하며 프로그래밍이 가능한 메모리(EEPROM)를 포함할 수 있다. 또한, 메모리(110)는 일부 예시에서, 비-일시적 저장 매체일 수도 있다. 여기서 단어 "비-일시적"이란 저장 매체가 반송파(carrier wave) 또는 전파 신호(propagated signal)를 포함하고 있지 않음을 나타낸다. 하지만, "비-일시적"이란 용어는 메모리가 이동 불가능한 것으로 해석되어서는 안된다. 일 예시에서, 메모리(110)는 보다 큰 용량의 정보를 저장할 수 있다. 일 예시에서, 비-일시적 저장 매체는 시간이 지남에 따라, 변할 수 있는 (예를 들어, 랜덤 액세스 메모리(RAM) 또는 캐쉬) 데이터를 저장할 수 있다. 메모리(110)는 전자 장치(100)의 내부 스토리지가 될 수 있고, 또는 전자 장치(100)의 외부 스토리지, 클라우드 스토리지, 또는 기타 다른 종류의 외부 스토리지일 수 있다.The memory 110 stores the identified at least one UI element and the characteristic of the database/knowledge graph, and the characteristic of the identified at least one UI element. In addition, the memory 110 stores instructions executed by the processor 120 . Memory 110 may include non-volatile storage elements. Examples of such non-volatile storage elements include magnetic hard disks, optical disks, floppy disks, flash memories, or electrically programmable memory (EPROM) or electrically erasable and programmable memory (EEPROM). may include Memory 110 may also be a non-transitory storage medium in some examples. Here, the word "non-transitory" denotes that the storage medium does not contain a carrier wave or a propagated signal. However, the term "non-transitory" should not be construed as meaning that memory is not removable. In one example, the memory 110 may store information of a larger capacity. In one example, the non-transitory storage medium may store data (eg, random access memory (RAM) or cache) that may change over time. The memory 110 may be an internal storage of the electronic device 100 , or may be an external storage of the electronic device 100 , cloud storage, or other types of external storage.

프로세서(120)는 메모리(110)에 저장된 인스트럭션을 수행하고 다양한 프로세스을 수행할 수 있다. 프로세서(120)는 적어도 하나의 프로세서를 포함할 수 있으며, 중앙 처리 장치(CPU) 또는 어플리케이션 프로세서(AP) 등과 같은 범용 프로세서일 수 있고, 그래픽 처리 장치(GPU)와 같은 그래픽 전용 처리 장치, 비주얼 처리 장치 (VPU) 및/또는 신경 처리 장치(NPU)와 같은 AI 전용 프로세서일 수 있다.The processor 120 may perform instructions stored in the memory 110 and perform various processes. The processor 120 may include at least one processor, and may be a general-purpose processor such as a central processing unit (CPU) or an application processor (AP), a graphics-only processing unit such as a graphics processing unit (GPU), and visual processing. It may be an AI-only processor such as a device (VPU) and/or a neural processing unit (NPU).

일 실시 예에 따르면, 프로세서(120)는 전자 장치(100)의 화면(140)에 포함된 적어도 하나의 UI(User Interface) 요소의 적어도 하나의 특성을 식별하고, 식별된 UI 요소의 적어도 하나의 특성에 기초하여 데이터베이스를 획득할 수 있다. 여기서, 데이터베이스는 식별된 UI 요소의 적어도 하나의 특성에 기초하여 획득된 자연어(Natural Language) 발화를 포함할 수 있다. 여기서, 적어도 하나의 특성은, 각 UI 요소의 상대적인 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 각 요소의 타입, 및 각 UI 요소의 형상 중 적어도 하나를 포함할 수 있다. 이 경우, 데이터베이스는 전자 장치(100)의 현재 화면(140)에 기초하여 실시간으로 생성될 수 있다.According to an embodiment, the processor 120 identifies at least one characteristic of at least one User Interface (UI) element included in the screen 140 of the electronic device 100, and selects at least one of the identified UI elements. A database may be obtained based on the characteristics. Here, the database may include natural language utterances obtained based on at least one characteristic of the identified UI element. Here, the at least one characteristic may include at least one of a relative position of each UI element, a function of each UI element, a performance of each UI element, a type of each element, and a shape of each UI element. In this case, the database may be generated in real time based on the current screen 140 of the electronic device 100 .

이 후, 프로세서(120)는 사용자의 음성 입력이 수신되면, 수신된 음성 입력의 발화가 데이터 베이스에 포함된 자연어 발화와 매칭되는지 여부를 식별하고, 음성 입력의 발화가 자연어 발화와 매칭되는 것으로 식별되면, 적어도 하나의 UI 요소를 자동으로 액세스할 수 있다. 여기서, 적어도 하나의 UI 요소는, 실행 가능한 UI 요소, 실행 불가능한 UI 요소, 텍스트 UI 요소, 또는 비-텍스트 UI 요소 중 적어도 하나를 포함할 수 있다. Thereafter, when the user's voice input is received, the processor 120 identifies whether the utterance of the received voice input matches the natural language utterance included in the database, and identifies that the utterance of the voice input matches the natural language utterance , at least one UI element can be accessed automatically. Here, the at least one UI element may include at least one of an executable UI element, a non-executable UI element, a text UI element, or a non-text UI element.

한편, 프로세서(120)는 수신된 음성 입력의 발화가 데이터 베이스에 포함된 자연어 발화와 매칭되는지 여부를 복수의 매칭 단계를 통해 식별할 수 있다. 이하에서는, 텍스트 UI 요소, 텍스트 UI 요소의 유의어(또는 동의어), 비텍스트 UI 요소를 순차적으로 비교하는 것으로 설명하였으나, 이러한 순서에 반드시 한정되는 것은 아니다. Meanwhile, the processor 120 may identify whether the utterance of the received voice input matches the natural language utterance included in the database through a plurality of matching steps. Hereinafter, it has been described that text UI elements, synonyms (or synonyms) of text UI elements, and non-text UI elements are sequentially compared, but it is not necessarily limited to this order.

일 예에 따라 프로세서(120)는 음성 입력의 발화를 데이터베이스에 포함된 텍스트 UI 요소에 대응되는 자연어 발화와 제1 비교할 수 있다. 여기서, 텍스트 UI 요소는 실행 가능한 UI 요소일 수 있다. 예를 들어, 프로세서(120)는 음성 입력의 발화에 대응되는 제1 비교 결과 데이터베이스에서 텍스트 UI 요소에 대응되는 자연어 발화가 식별되면(예를 들어, 임계 수치 이상의 유사도를 가지는 자연어 발화가 식별되면), 식별된 텍스트 UI 요소에 자동으로 액세스할 수 있다. According to an example, the processor 120 may first compare the utterance of the voice input with the natural language utterance corresponding to the text UI element included in the database. Here, the text UI element may be an executable UI element. For example, when the natural language utterance corresponding to the text UI element is identified in the first comparison result database corresponding to the utterance of the voice input, the processor 120 may determine (eg, when the natural language utterance having a similarity greater than or equal to a threshold value is identified) , can automatically access the identified text UI element.

일 예에 따라 제1 비교 결과 특정 텍스트 UI 요소가 식별되는 경우 제2 비교를 진행하지 않을 수도 있으나, 이에 한정되는 것은 아니며 제1 비교 결과 특정 텍스트 UI 요소가 식별되더라도 좀 더 정확한 매칭을 위하여 제2 비교를 진행할 수도 있다. According to an example, when a specific text UI element is identified as a result of the first comparison, the second comparison may not be performed, but the present invention is not limited thereto. A comparison may be made.

또한, 프로세서(120)는 제1 비교 결과에 따라 음성 입력의 발화를 상기 데이터베이스에 포함된 텍스트 UI 요소의 유의어(Synonym)에 대응되는 자연어 발화와 제2 비교할 수 있다. 예를 들어, 프로세서(120)는 제2 비교 결과 데이터베이스에서 텍스트 UI 요소의 유의어(Synonym)(또는 동의어)에 대응되는 자연어 발화가 식별되면(예를 들어, 임계 수치 이상의 유사도를 가지는 자연어 발화가 식별되면), 해당 텍스트 UI 요소에 자동으로 액세스할 수 있다. 이 경우 프로세서(120)는 제2 비교 결과 뿐 아니라 제1 비교 결과에 기초하여 텍스트 UI 요소를 식별할 수 있다. Also, the processor 120 may second compare the speech input utterance with the natural language utterance corresponding to a synonym of the text UI element included in the database according to the first comparison result. For example, when a natural language utterance corresponding to a synonym (or a synonym) of a text UI element is identified in the second comparison result database, the processor 120 identifies (eg, a natural language utterance having a similarity greater than or equal to a threshold value) ), you can automatically access that text UI element. In this case, the processor 120 may identify the text UI element based on the first comparison result as well as the second comparison result.

일 예에 따라 제2 비교 결과 특정 텍스트 UI 요소가 식별되는 경우 제3 비교를 진행하지 않을 수도 있으나, 이에 한정되는 것은 아니며 제2 비교 결과 특정 텍스트 UI 요소가 식별되더라도 좀 더 정확한 매칭을 위하여 제3 비교를 진행할 수도 있다. According to an example, when a specific text UI element is identified as a result of the second comparison, the third comparison may not be performed, but the present invention is not limited thereto. A comparison may be made.

또한, 프로세서(120)는 제2 비교 결과에 따라 음성 입력의 발화를 데이터베이스에 포함된 비 텍스트 UI 요소의 형상에 대응되는 자연어 발화와 제3 비교할 수 있다. 여기서, 비 텍스트 UI 요소는 실행 가능한 UI 요소일 수 있다. 예를 들어, 프로세서(120)는 제3 비교 결과 데이터베이스에서 비 텍스트 UI 요소의 형상에 대응되는 자연어 발화가 식별되면(예를 들어, 임계 수치 이상의 유사도를 가지는 자연어 발화가 식별되면), 식별된 비 텍스트 UI 요소에 자동으로 액세스할 수 있다. 이 경우 프로세서(120)는 제3 비교 결과 뿐 아니라 제1 비교 결과 또는 제2 비교 결과 중 적어도 하나에 기초하여 비 텍스트 UI 요소를 식별할 수 있다.Also, the processor 120 may perform a third comparison of the speech input utterance with the natural language utterance corresponding to the shape of the non-text UI element included in the database according to the second comparison result. Here, the non-text UI element may be an executable UI element. For example, when the natural language utterance corresponding to the shape of the non-text UI element is identified in the third comparison result database (eg, natural language utterance having a similarity greater than or equal to a threshold value is identified), the processor 120 determines the identified ratio You can automatically access text UI elements. In this case, the processor 120 may identify the non-text UI element based on at least one of the first comparison result and the second comparison result as well as the third comparison result.

또한, 프로세서(120)는 데이터베이스에 저장된 신경망 모델의 학습 결과를 획득하고, 제1 비교 결과, 제2 비교 결과 또는 제3 비교 결과 중 적어도 하나와 학습 결과에 기초하여 매칭 여부를 식별할 수 있다. 여기서, 신경망 모델은 다양한 학습 데이터에 의해 학습되어 매칭 결과를 출력하도록 학습된 모델로서 이에 대해서는 이후에 자세히 설명하도록 한다. In addition, the processor 120 may obtain a learning result of the neural network model stored in the database, and identify whether or not matching is performed with at least one of the first comparison result, the second comparison result, or the third comparison result and the learning result. Here, the neural network model is a model trained to output a matching result by being learned by various learning data, which will be described in detail later.

또한, 프로세서(120)는 화면에 포함된 적어도 하나의 UI 요소에 대한 매칭 스코어를 획득할 수 있다. 여기서, 매칭 스코어는 음성 입력의 발화를 데이터베이스에 포함된 UI 요소에 대응되는 자연어 발화를 비교하여 유사도에 기초하여 획득될 수 있다. Also, the processor 120 may obtain a matching score for at least one UI element included in the screen. Here, the matching score may be obtained based on the degree of similarity by comparing the utterance of the voice input with the natural language utterance corresponding to the UI element included in the database.

이 경우, 프로세서(120)는 화면 상에 포함된 복수의 UI 요소 각각의 매칭 스코어가 제1 임계 수치 이상이고, 복수의 UI 요소 각각의 매칭 스코어 차이가 제2 임계 수치 내인 경우 복수의 UI 요소 중 어느 하나를 선택하기 위한 가이드 UI를 제공할 수 있다. 여기서, 제1 임계 수치는 음성 입력의 발화를 데이터베이스에 포함된 UI 요소에 대응되는 자연어 발화의 동일/유사도를 식별하기 위한 수치로 기 설정되어 있거나, 사용자에 의해 설정될 수 있다. 또한, 제2 임계 수치는 복수의 UI 요소 각각의 매칭 정도가 동일/유사하다고 볼 수 있는 정도를 식별하기 위한 수치로 기 설정되어 있거나, 사용자에 의해 설정될 수 있다. In this case, when the matching score of each of the plurality of UI elements included on the screen is equal to or greater than the first threshold value, and the difference between the matching scores of each of the plurality of UI elements is within the second threshold value, the processor 120 selects one of the plurality of UI elements. A guide UI for selecting any one may be provided. Here, the first threshold value may be preset as a numerical value for identifying the same/similarity of the natural language utterance corresponding to the UI element included in the database for the utterance of the voice input, or may be set by the user. In addition, the second threshold value may be preset as a value for identifying the degree to which the matching degree of each of the plurality of UI elements is considered to be the same/similar, or may be set by the user.

또한, 프로세서(120)는 하나의 UI 요소의 매칭 스코어가 제1 임계 수치 이상이고, 나머지 UI 요소보다 제2 임계 수치보다 큰 경 하나의 UI 요소에 액세트 즉, 해당 UI 요소를 실행할 수 있다. 다만, 이 경우에도 프로세서(120)는 해당 UI 요소에 대한 사용자의 실행 확인을 요청하는 가이드 UI를 할 수도 있다. In addition, when the matching score of one UI element is equal to or greater than the first threshold value and greater than the second threshold value than the other UI elements, the processor 120 may access one UI element, that is, execute the corresponding UI element. However, even in this case, the processor 120 may perform a guide UI requesting the user's execution confirmation for the corresponding UI element.

또한, 프로세서(120)는 화면에 포함된 모든 UI 요소의 매칭 스코어가 제1 임계 수치 미만인 경우 해당 정보를 제공하는 가이드 UI를 제공할 수 있다. 예를 들어, 프로세서(120)는 사용자의 음성 입력에 대응되는 UI 요소를 식별할 수 없다는 정보를 제공할 수 있다. 다만, 프로세서(120)는 이 경우에도 가장 매칭 스코어가 높은 UI 요소를 추천 UI 요소로 제공하는 것도 가능하다. Also, when the matching scores of all UI elements included in the screen are less than the first threshold value, the processor 120 may provide a guide UI providing corresponding information. For example, the processor 120 may provide information indicating that a UI element corresponding to the user's voice input cannot be identified. However, even in this case, the processor 120 may provide a UI element having the highest matching score as a recommended UI element.

또한, 프로세서(120)는 각 UI 요소의 위치, 다른 UI 요소에 대한 각 UI 요소의 상대적인 위치, 각 UI 요소의 기능(function), 각 UI 요소의 성능(capability), 각 UI 요소의 타입 및 각 UI 요소의 형상 중 적어도 하나를 식별하여 지식 그래프를 획득하고 획득된 지식 그래프를 데이터베이스에 저장할 수 있다. In addition, the processor 120 determines the position of each UI element, the relative position of each UI element with respect to other UI elements, the function of each UI element, the capability of each UI element, the type of each UI element, and each A knowledge graph may be obtained by identifying at least one of the shapes of UI elements, and the obtained knowledge graph may be stored in a database.

또한, 프로세서(120)는 각 UI 요소의 위치에 기초한 각 UI 요소 간의 유사도, 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 성능 및 각 UI 요소의 형상 중 적어도 하나의 유사도를 식별하고, 식별된 유사도에 기초하여 각 UI 요소를 클러스터링하여 지식 그래프를 획득하고, 획득된 지식 그래프를 데이터베이스에 저장할 수 있다. In addition, the processor 120 determines the degree of similarity between each UI element based on the position of each UI element, the relative position of each UI element, the function of each UI element, the performance of each UI element, and the similarity of at least one of the shape of each UI element. It is possible to identify and cluster each UI element based on the identified similarity to obtain a knowledge graph, and store the acquired knowledge graph in a database.

또한, 프로세서(120)는 각 UI 요소의 가시적 특성 및 상대적 위치로부터 비-텍스트 UI 요소의 텍스트 표현(textual representation)을 결정할 수 있다. In addition, the processor 120 may determine a textual representation of the non-text UI element from the visible characteristics and relative position of each UI element.

또한, 프로세서(120)는 각 UI 요소 및 각 UI 요소의 성능에 대응되는 기 정의된 정보를 매핑하여 지식 그래프를 획득하고, 획득된 지식 그래프를 데이터베이스에 저장할 수 있다. 이 경우, 프로세서(120)는 동작 가능한 UI 요소를 통해 전이된 기 정의된 화면 시퀀스를 결정하여 지식 그래프를 획득할 수 있다. In addition, the processor 120 may obtain a knowledge graph by mapping each UI element and predefined information corresponding to the performance of each UI element, and store the acquired knowledge graph in a database. In this case, the processor 120 may obtain a knowledge graph by determining a predefined screen sequence transferred through an operable UI element.

또한, 프로세서(120)는 동작 가능한 UI 요소를 통해 전이된 기 정의된 화면 시퀀스를 결정하여 지식 그래프를 획득할 수 있다. Also, the processor 120 may obtain a knowledge graph by determining a predefined screen sequence transferred through an operable UI element.

또한, 프로세서(120)는 싱글 스텝 의도 및 멀티 스텝 의도 중 적어도 하나에 대한 자연어 변형(variations)을 획득하기 위해 지식 그래프에 대한 시맨틱 번역(semantic translation)을 수행하고, 지식 그래프를 사용하여 싱글 스텝 의도 및 멀티 스텝 의도 중 적어도 하나에 대한 적어도 하나의 액션 및 적어도 하나의 액션 시퀀스를 식별하고 획득된 자연어 변형을 식별된 액션 및 액션 시퀀스와 맵핑함으로써 식별된 UI 요소의 자연어 발화를 예측하기 위한 자연어 모델을 동적으로 생성할 수 있다. In addition, the processor 120 performs semantic translation on the knowledge graph to obtain natural language variations for at least one of the single-step intent and the multi-step intent, and uses the knowledge graph to perform the single-step intent and a natural language model for predicting natural language utterances of the identified UI elements by identifying at least one action and at least one action sequence for at least one of the multi-step intents and mapping the obtained natural language variants with the identified actions and action sequences. It can be created dynamically.

또한, 프로세서(120)는 지식 그래프의 각 UI 요소를 화면(140) 상의 도메인, 동사, 동의어, 슬롯, 슬롯 타입, 텍스처로 표현된 슬롯, 성능 및 상대 위치 중 적어도 하나로 카테고리화하고, 카테고리화에 기초하여 자연어 변형을 획득할 수 있다. In addition, the processor 120 categorizes each UI element of the knowledge graph into at least one of a domain, a verb, a synonym, a slot, a slot type, a slot expressed as a texture, a performance, and a relative position on the screen 140, Based on it, a natural language transformation can be obtained.

또한, 프로세서(120)는 화면(140) 상에 포함된 UI 요소 타입에 관한 정보 및 이에 대응되는 성능, 성능과 관련된 동사, 액션 정보에 대한 성능, 화면에서 동작 가능한 요소 시퀀스 그래프 및 대응되는 액션 시퀀스, 기 정의된 액션 및 액션 시퀀스를 포함하는 기 정의된 테이블 세트를 결정하고, 각 UI 요소의 성능에 기초하여 액션 루틴을 결정할 수 있다. 이 경우, 기 정의된 액션 및 액션 시퀀스에는 고유한 아이덴티티가 할당될 수 있다. In addition, the processor 120 includes information on the types of UI elements included on the screen 140 and performance corresponding thereto, verbs related to performance, performance on action information, a graph of an element sequence operable on the screen, and a corresponding action sequence , a predefined table set including predefined actions and action sequences may be determined, and an action routine may be determined based on the performance of each UI element. In this case, a unique identity may be assigned to a predefined action and an action sequence.

또한, 프로세서(120)는 유사한 자연어 변형들을 클러스터링하고, 유사한 자연어 변형들에 대한 동적 의도를 할당하고, 적어도 하나의 식별된 액션 및 적어도 하나의 식별된 액션 시퀀스를 동적 의도와 연관시키고, 클러스터링된 자연어 변형, 동적 의도, 및 액션 루틴들에 기초하여 자연어 모델을 동적으로 생성하고, 자연어 모델을 데이터베이스에 저장할 수 있다.Further, the processor 120 clusters similar natural language variants, assigns a dynamic intent to the similar natural language variants, associates the at least one identified action and the at least one identified action sequence with the dynamic intent, and A natural language model may be dynamically generated based on transformations, dynamic intentions, and action routines, and the natural language model may be stored in a database.

또한, 프로세서(120)는 화면(140) 상에 포함된 화면 정보를 판독함으로써 텍스트 표현 스코어를 결정하고, 화면 정보에 포함된 텍스트 표현 및 텍스트 UI 요소로부터 동사 및 명사에 대한 동의어를 추출하고 동의어 스코어를 할당하고, 동적 언어 생성기(170E)에 대한 가중 학습에 이용되는 자연어 변형들을 연관시키는 분산 스코어를 결정하고, 음성 입력의 발화에 포함된 참조 오브젝트들과 근접 요소 정보를 비교함으로써 관련성 스코어를 결정할 수 있다. In addition, the processor 120 determines a text expression score by reading screen information included on the screen 140 , extracts synonyms for verbs and nouns from text expressions and text UI elements included in the screen information, and scores a synonym The relevance score can be determined by assigning , determining a variance score correlating the natural language variants used for weighted learning to the dynamic language generator 170E, and comparing the proximity element information with reference objects included in the utterance of the speech input. there is.

또한, 동적 언어 생성기(170E)에 대해 음성 입력의 발화와 매칭되는 최종 스코어로서 매칭 스코어를 결정하고, 결정된 매칭 스코어를 관련성 스코어와 결합할 수 있다. It may also determine a matching score for dynamic language generator 170E as a final score that matches an utterance of the voice input, and combine the determined matching score with a relevance score.

도 2b는 일 실시 예에 따라 전자 장치(100)와 사용자의 음성 인터렉션을 기초로 전자 장치(100)의 화면(140)에 표시된 UI 요소를 자동으로 액세스하기 위한 전자 장치(100)를 도시한 블록도이다.2B is a block diagram illustrating the electronic device 100 for automatically accessing a UI element displayed on the screen 140 of the electronic device 100 based on the user's voice interaction with the electronic device 100, according to an embodiment. It is also

일 실시 예에서, 전자 장치(100)는 메모리(110), 프로세서(120), 통신부(130), 디스플레이(140), 센서(150), 어플리케이션 컨트롤러 (160), 및 인터렉션 엔진(170)을 포함할 수 있다. 도 2b에 도시된 구성 중 도 2a에 도시된 구성과 중복되는 구성에 대해서는 자세한 설명을 생략하도록 한다. In an embodiment, the electronic device 100 includes a memory 110 , a processor 120 , a communication unit 130 , a display 140 , a sensor 150 , an application controller 160 , and an interaction engine 170 . can do. A detailed description of the configuration overlapping with the configuration shown in FIG. 2A among the configurations shown in FIG. 2B will be omitted.

통신부(communicator)(130)는 유선 또는 무선 통신을 가능하게 하는 규격 기반의 전자 회로를 포함한다. 통신부(130)는 적어도 하나의 네트워크를 통해 내부 하드웨어 컴포넌트들과 외부 장치들이 내부 통신하도록 구성되어있다. The communicator 130 includes a standard-based electronic circuit that enables wired or wireless communication. The communication unit 130 is configured so that internal hardware components and external devices communicate internally through at least one network.

적어도 하나의 센서(150 또는 150a-n)는, 예를 들어, 조도 센서(ambient light sensor), 3-축 가속도계(3-axis accelerometer), 고도계(altimeter), 광학식 심박수 센서(optical heart rate sensor), 산소 포화도(SpO2) 모니터(oxygen saturation (SpO2) monitor), 생체 전기 저항 센서(bioimpedance sensor), 근접 센서(proximity sensor), 나침반(compass), 심전도(ECG) 센서(Electrocardiogram (ECG) sensor)), 위성 위치 확인 시스템(GPS, Global Positioning System), 자이로스코프(gyroscope), 제스처 센서(gesture sensor), 자외선(Ultraviolet, UV) 센서, 자력계(magnetometer), 피부 전기 액티비티 센서(electrodermal activity sensor), 피부 온도 센서(skin temperature sensor) 등을 포함할 수 있지만, 이에 한정되는 것은 아니다. The at least one sensor 150 or 150a-n is, for example, an ambient light sensor, a 3-axis accelerometer, an altimeter, an optical heart rate sensor. , oxygen saturation (SpO2) monitor, bioimpedance sensor, proximity sensor, compass, Electrocardiogram (ECG) sensor) , Global Positioning System (GPS), gyroscope, gesture sensor, Ultraviolet (UV) sensor, magnetometer, electrodermal activity sensor, skin It may include, but is not limited to, a temperature sensor (skin temperature sensor) and the like.

어플리케이션 컨트롤러(application controller)(160)는 전자 장치(100)의 적어도 하나의 어플리케이션들(160a-160n)을 제어하도록 구성된다. 어플리케이션의 예시들에는 웹 어플리케이션, 비디오 플레이어 어플리케이션, 카메라 어플리케이션, 사업 어플리케이션, 학습 어플리케이션, 건강 어플리케이션, 라이프스타일 어플리케이션, 엔터테인먼트 어플리케이션, 유틸리티 어플리케이션, 여행 어플리케이션, 등이 포함되지만, 이에 한정되는 것은 아니다. The application controller 160 is configured to control at least one application 160a - 160n of the electronic device 100 . Examples of the application include, but are not limited to, a web application, a video player application, a camera application, a business application, a learning application, a health application, a lifestyle application, an entertainment application, a utility application, a travel application, and the like.

일 실시 예에서, 인터렉션 엔진(interaction engine)(170)은 로직 게이트(logic gates), 직접 회로(integrated circuits), 마이크로프로세서(microprocessors), 마이크로컨트롤러(microcontrollers), 메모리 회로(memory circuits), 수동적 전자 컴포넌트(passive electronic components), 능동적 전자 컴포넌트(active electronic components), 광학 컴포넌트(optical components), 하드와이어드 회로(hardwired circuits) 등과 같은 프로세싱 회로들로 구현되고, 펌웨어(firmware)에 의해 선택적으로 구동될 수 있다. 회로들은, 예를 들어, 적어도 하나의 반도체들(semiconductors)로 구현될 수 있다.In one embodiment, interaction engine 170 includes logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronics may be implemented with processing circuits such as passive electronic components, active electronic components, optical components, hardwired circuits, etc., and selectively driven by firmware there is. The circuits may be implemented with, for example, at least one semiconductor (semiconductors).

일 실시 예에서, 인터렉션 엔진(170)은 컨텍스트 오케스트레이터(context orchestrator)(170a), 시각적 시맨틱 자연어 예측기(visual semantic NL estimator)(170b), 액션 시퀀스 플래너(action sequence planner)(170c), 자연어 합성기(NL synthesizer)(170d), 동적 언어 생성기(dynamic language generator)(170e)(예를 들어, 보이스 어시스턴트) 및 인공 지능 엔진(Artificial intelligence (AI) engine)(170f)을 포함한다.In one embodiment, the interaction engine 170 includes a context orchestrator 170a, a visual semantic NL estimator 170b, an action sequence planner 170c, a natural language synthesizer. (NL synthesizer) 170d, dynamic language generator 170e (eg, voice assistant), and artificial intelligence (AI) engine 170f.

일 실시 예에서, 컨텍스트 오케스트레이터(context orchestrator)(170a)는 AI 엔진(170f)을 사용하여 전자 장치(100)의 화면(140)에 표시된 적어도 하나의 UI 요소들을 식별한다. 적어도 하나의 UI 요소들(예를 들어, 클릭 버튼(click button), 홈 아이콘(home icon), 텍스트 바(text bar) 등)은 실행 가능한 UI 요소, 실행 불가능한 UI 요소, 텍스트 UI 요소 및 비-텍스트 UI 요소를 포함한다. 또한, 컨텍스트 오케스트레이터(170a)는 AI 엔진(170f)을 사용하여 식별된 적어도 하나의 UI 요소들의 특성(들)을 결정한다. 식별된 적어도 하나의 UI 요소들의 특성은 각 UI 요소의 위치, 다른 UI 요소들 중에서 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 각 UI 요소의 종류, 또한, 전자 장치(100)의 화면(140)에 표시된 각 UI 요소의 외형을 포함한다.In an embodiment, the context orchestrator 170a identifies at least one UI element displayed on the screen 140 of the electronic device 100 using the AI engine 170f. At least one UI element (eg, click button, home icon, text bar, etc.) may include an executable UI element, a non-executable UI element, a text UI element and a non- Contains text UI elements. Further, the context orchestrator 170a determines the characteristic(s) of the at least one UI element identified using the AI engine 170f. The identified characteristics of the at least one UI element include the position of each UI element, the relative position of each UI element among other UI elements, the function of each UI element, the performance of each UI element, the type of each UI element, and also the electronic device. The appearance of each UI element displayed on the screen 140 of 100 is included.

또한, 컨텍스트 오케스트레이터(170a)는 AI 엔진(170f)을 사용하여 식별된 적어도 하나의 UI 요소들의 특성(들)을 기초로 데이터베이스/지식 그래프를 획득한다. 또한, 컨텍스트 오케스트레이터(170a)는 각 UI 요소의 위치, 다른 UI 요소들 중에서 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 각 UI 요소의 종류, 또한, 전자 장치(100)의 화면(140)에 표시된 각 UI 요소의 외형을 결정한다. 또한, 컨텍스트 오케스트레이터(170a)는 각 UI 요소의 위치, 다른 UI 요소들 중에서 각 UI 요소의 상대적 위치, 각 UI 요소의 기능, 각 UI 요소의 성능, 각 UI 요소의 종류, 또한, 전자 장치(100)의 화면(140)에 표시된 각 UI 요소의 외형을 기초로 다른 UI 요소들 중에서 적어도 하나의 UI 요소 중 각 UI 요소 간의 유사도를 결정한다. 또한, 컨텍스트 오케스트레이터(170a)는 유사도를 기초로 적어도 하나의 UI 요소 중 각 UI 요소를 클러스터링한다. 또한, 컨텍스트 오케스트레이터(170a)는 가시적인 특징들로부터 비-텍스트 UI 요소들 중 텍스트 표현 및 다른 UI 요소들 중 각 UI 요소의 상대적 위치를 결정한다. 또한, 컨텍스트 오케스트레이터(170a)는 적어도 하나의 UI 요소들 중 각 UI 요소 간의 기 정의된 정보 및 대응되는 각 UI 요소들의 성능을 맵핑한다. 또한, 컨텍스트 오케스트레이터(170a)는 실행 가능한 UI 요소를 통해 전환된 기 정의된 화면 시퀀스를 결정한다. 또한, 컨텍스트 오케스트레이터(170a)는 획득된 지식 그래프를 데이터베이스에 저장한다.Further, the context orchestrator 170a obtains a database/knowledge graph based on the characteristic(s) of the at least one UI element identified using the AI engine 170f. In addition, the context orchestrator 170a determines the position of each UI element, the relative position of each UI element among other UI elements, the function of each UI element, the performance of each UI element, the type of each UI element, and also the electronic device ( 100) determines the appearance of each UI element displayed on the screen 140 . In addition, the context orchestrator 170a determines the position of each UI element, the relative position of each UI element among other UI elements, the function of each UI element, the performance of each UI element, the type of each UI element, and also the electronic device ( 100), a degree of similarity between each UI element among at least one UI element among other UI elements is determined based on the appearance of each UI element displayed on the screen 140 of FIG. Also, the context orchestrator 170a clusters each UI element among at least one UI element based on the similarity. In addition, the context orchestrator 170a determines, from the visible characteristics, the text representation among non-text UI elements and the relative position of each UI element among other UI elements. In addition, the context orchestrator 170a maps predefined information between each UI element among at least one UI element and the performance of each corresponding UI element. In addition, the context orchestrator 170a determines a predefined screen sequence switched through the executable UI element. In addition, the context orchestrator 170a stores the acquired knowledge graph in the database.

일 실시 예에서, 시각적 시맨틱 자연어 예측기(170b)는 획득된 지식 그래프에 대한 의미적 번역을 수행하여 싱글 스텝 의도 및/또는 멀티 스텝 의도를 위해 자연어 변형을 획득한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 수신된 지식 그래프의 각 UI 요소를 도메인, 동사, 유의어, 슬롯, 슬롯 종류, 텍스트 표시 슬롯, 성능, 및 전자 장치(100)의 화면(140) 상의 상대적 위치로 카테고리화한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 카테고리화에 따른 자연어 변형을 획득한다.In an embodiment, the visual semantic natural language predictor 170b performs semantic translation on the obtained knowledge graph to obtain natural language transformations for single-step intent and/or multi-step intent. In addition, the visual semantic natural language predictor 170b receives the knowledge graph obtained from the context orchestrator 170a. In addition, the visual semantic natural language predictor 170b calculates each UI element of the received knowledge graph as a domain, a verb, a synonym, a slot, a slot type, a text display slot, a performance, and a relative position on the screen 140 of the electronic device 100 . categorize as In addition, the visual semantic natural language predictor 170b acquires natural language transformations according to categorization.

일 실시 예에서, 액션 시퀀스 플래너(170c)는 획득된 지식 그래프를 사용하여 싱글 스텝 의도 및/또는 멀티 스텝 의도를 위해 액션 및 액션 시퀀스를 식별한다. 또한, 액션 시퀀스 플래너(170c)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. 또한, 액션 시퀀스 플래너(170c)는 각 UI 요소들의 성능을 기초로 액션 루틴을 결정한다.In one embodiment, the action sequence planner 170c uses the acquired knowledge graph to identify actions and action sequences for single-step intents and/or multi-step intents. In addition, the action sequence planner 170c receives the knowledge graph obtained from the context orchestrator 170a. In addition, the action sequence planner 170c determines an action routine based on the performance of each UI element.

일 실시 예에서, 자연어 합성기(170d)는 자연어 모델을 동적으로 획득하여 식별된 액션 및 식별된 액션 시퀀스로 획득된 자연어 변형을 맵핑하여 식별된 적어도 하나의 UI 요소들의 자연어 발화를 예측한다. 또한, 자연어 합성기(170d)는 유사한 자연어 변형들을 클러스터링한다. 또한, 자연어 합성기(170d)는 유사한 자연어 변형들을 위해 동적 의도를 부과한다. 또한, 자연어 합성기(170d)는 동적 의도를 식별된 활동과 식별된 활동 시퀀스에 연관시킨다. 또한, 자연어 합성기(170d)는 클러스터링된 자연어 변형, 동적 의도 및 액션 루틴을 기초로 자연어 모델을 동적으로 획득한다. 또한, 자연어 합성기(170d)는 동적으로 획득된 자연어 모델을 데이터베이스에 저장한다.In an embodiment, the natural language synthesizer 170d predicts natural language utterance of the identified at least one UI element by dynamically acquiring the natural language model and mapping the obtained natural language transformation to the identified action and the identified action sequence. In addition, natural language synthesizer 170d clusters similar natural language variants. In addition, natural language synthesizer 170d imposes dynamic intent for similar natural language variants. In addition, natural language synthesizer 170d associates the dynamic intent with the identified activity and the identified sequence of activities. In addition, the natural language synthesizer 170d dynamically acquires a natural language model based on the clustered natural language transformation, dynamic intent, and action routine. In addition, the natural language synthesizer 170d stores the dynamically acquired natural language model in the database.

일 실시 예에서, 동적 언어 획득기(170e)는 자연어 합성기(170d)로부터 입력을 수신 받는다. 또한, 동적 언어 획득기(170e)는 전자 장치(100)의 사용자로부터 음성 입력을 수신하며, 음성 입력은 데이터베이스에 제시된 식별된 적어도 하나의 UI 요소들의 특성을 가리키는 발화를 포함한다. 또한, 동적 언어 획득기(170e)는 수신된 음성 입력의 발화가 식별된 적어도 하나의 UI 요소들의 예측된 자연어 발화와 매칭되는지 여부를 결정한다. 또한, 동적 언어 획득기(170e)는 사용자로부터 수신된 음성 입력의 발화들이 식별된 적어도 하나의 UI 요소들의 예측된 자연어 발화과 일치하면 UI 요소들 중 적어도 하나의 UI 요소들에 자동 액세스한다.In an embodiment, the dynamic language acquirer 170e receives an input from the natural language synthesizer 170d. In addition, the dynamic language acquirer 170e receives a voice input from the user of the electronic device 100 , wherein the voice input includes an utterance indicating a characteristic of the identified at least one UI element presented in the database. Further, the dynamic language acquirer 170e determines whether an utterance of the received voice input matches a predicted natural language utterance of the identified at least one UI element. Further, the dynamic language acquirer 170e automatically accesses at least one of the UI elements if the utterances of the voice input received from the user match the predicted natural language utterances of the identified at least one UI elements.

상술한 바와 같이 최소 적어도 하나의 모듈들/컴포넌트들은 AI 엔진(170f)를 통해 구현될 수 있다. AI 엔진(170f)과 관련된 기능은 메모리(110) 및 프로세서(120)을 통해 수행될 수 있다. 하나 또는 복수의 프로세서는 비-휘발성 메모리 또는 휘발성 메모리에 저장된 기 정의된 동작 규칙 또는 AI 엔진(170f)에 따라 입력 정보의 처리를 제어할 수 있다. 기 정의된 동작 규칙 또는 인공 지능 모델은 훈련(training) 또는 학습(learning)을 통해 제공된다.As described above, at least one or more modules/components may be implemented through the AI engine 170f. Functions related to the AI engine 170f may be performed through the memory 110 and the processor 120 . One or more processors may control processing of input information according to a predefined operation rule or AI engine 170f stored in non-volatile memory or volatile memory. A predefined action rule or artificial intelligence model is provided through training or learning.

여기서, 학습을 통해 제공된다는 것은, 복수의 학습 데이터에 학습 과정을 적용시킴으로써 원하는 특성의 기 정의된 동작 규칙 또는 AI 엔진(170f)이 만들어짐을 의미한다. 학습은 본 실시 예에 따른 AI를 수행하는 전자 장치(100) 자체에서 수행될 수 있고 또는 별도의 서버/시스템을 통해 구현될 수 있다.Here, being provided through learning means that a predefined operation rule or AI engine 170f of a desired characteristic is created by applying a learning process to a plurality of learning data. Learning may be performed in the electronic device 100 itself performing AI according to the present embodiment, or may be implemented through a separate server/system.

AI 엔진(170f)는 복수의 신경망 레이어들로 구성될 수 있다. 각 레이어는 복수의 가중치들을 갖고 있으며, 이전 레이어의 계산과 복수의 가중치들 간의 연산을 통해 레이어 연산을 수행한다. 신경망의 예로는 CNN(Convolutional Neural Network), DNN(Deep Neural Network), RNN(Recurrent Neural Network), RBM(Restricted Boltzmann Machine), DBN(Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network), GAN(Generative Adversarial Networks) 및 심층 Q-네트워크(Deep Q-Networks)이 있으며, 전술한 예에 한정되지 않는다.The AI engine 170f may be configured with a plurality of neural network layers. Each layer has a plurality of weights, and a layer operation is performed through calculation of the previous layer and an operation between the plurality of weights. Examples of neural networks include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Bidirectional Recurrent Deep Neural Network (BRDNN), and GAN (GAN). Generative Adversarial Networks) and Deep Q-Networks, which are not limited to the above examples.

학습 과정은, 다수의 학습 데이터들을 이용하여 소정의 대상 기기(예를 들어, 로봇, IoT 장치)를 훈련시켜 소정의 대상 기기 스스로 결정을 내리거나 예측을 할 수 있도록 하고, 허용하고 제어하는 방법이다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으며, 전술한 예에 한정되지 않는다.The learning process is a method of training, allowing, and controlling a predetermined target device (eg, a robot, an IoT device) to make a decision or make a prediction on its own by using a plurality of learning data. . Examples of the learning algorithm include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but is not limited to the above example.

도 2a 및 도 2b에는 전자 장치(100)의 각종 하드웨어 구성요소가 도시되어 있지만, 다른 실시 예들은 이에 한정되지 않는 것으로 이해되어야 한다. 다른 실시 예에서, 전자 장치(100)는 더 적거나 더 많은 개수의 구성 요소를 포함할 수 있다. 또한, 구성 요소의 레이블 또는 이름은 설명하기 위하여 사용될 뿐이며 발명의 범위를 제한하지 않는다. 적어도 하나의 구성요소는 함께 결합되어 음성 기반 인터랙션과 동일하거나 실질적으로 비슷한 기능을 수행할 수 있다.Although various hardware components of the electronic device 100 are illustrated in FIGS. 2A and 2B , it should be understood that other embodiments are not limited thereto. In another embodiment, the electronic device 100 may include fewer or more components. In addition, labels or names of components are used only for description and do not limit the scope of the invention. At least one component may be coupled together to perform the same or substantially similar function as the voice-based interaction.

도 3은 일 실시 예에 따른 사용자와 전자 장치(100)의 음성 인터랙션에 기초하여 전자 장치(100)의 화면(140)에 표시되는 UI 요소에 자동적으로 액세스하는 방법을 설명하는 흐름도(300)이다. 동작(302-312)은 전자 장치(100)에 의해 수행될 수 있다. 3 is a flowchart 300 illustrating a method of automatically accessing a UI element displayed on the screen 140 of the electronic device 100 based on a voice interaction between the user and the electronic device 100 according to an exemplary embodiment. . Operations 302-312 may be performed by the electronic device 100 .

도 3에 도시된 방법에 따르면 302에서 전자 장치(100)의 화면(140)에 표시된 적어도 하나의 UI 요소를 식별한다. According to the method illustrated in FIG. 3 , at least one UI element displayed on the screen 140 of the electronic device 100 is identified in 302 .

또한, 304에서 식별된 적어도 하나의 UI 요소의 특성을 판단한다.Also, at 304 , a characteristic of the identified at least one UI element is determined.

또한, 306에서 식별된 적어도 하나의 UI 요소의 특성에 기초하여 데이터베이스를 획득하고, 데이터베이스는 적어도 하나의 식별된 UI 요소의 자연어 발화를 포함하고, 자연어 발화는 식별된 적어도 하나의 UI 요소의 적어도 하나의 특성에 기초하여 예측된다.Further, at 306 , a database is obtained based on a characteristic of the at least one identified UI element, the database comprising a natural language utterance of the at least one identified UI element, wherein the natural language utterance is at least one of the identified at least one UI element. predicted based on the characteristics of

또한, 308에서 전자 장치(100)의 사용자의 음성 입력을 수신하고, 음성 입력은 데이터베이스에 나타나는 식별된 적어도 하나의 UI 요소의 특성의 발화를 포함한다. Also, in 308 , a voice input of the user of the electronic device 100 is received, and the voice input includes an utterance of a characteristic of the identified at least one UI element appearing in the database.

또한, 310에서 수신된 음성 입력의 발화가 식별된 적어도 하나의 UI 요소의 예측된 자연어 발화에 매칭되는지 판단한다. Also, at 310 , it is determined whether the utterance of the received voice input matches the predicted natural language utterance of the identified at least one UI element.

또한, 312에서 사용자로부터 입력된 수신된 음성의 발화가 식별된 적어도 하나의 UI 요소의 예측된 자연어 발화에 일치하는 것으로 판단되면 적어도 하나의 UI 요소 중의 UI 요소에 자동적으로 액세스한다.In addition, if it is determined in 312 that the utterance of the received voice input from the user matches the predicted natural language utterance of the identified at least one UI element, the UI element among the at least one UI element is automatically accessed.

흐름도(300)의 각 액션(actions), 동작(acts), 블록, 단계 등은 묘사된 순서로, 다른 순서로, 또는 동시에 수행될 수 있다. 또한, 어떤 실시 예들에서, 일부 액션, 동작, 블록, 단계 등은 본 발명의 범위를 벗어남이 없이 생략, 추가, 변형, 또는 건너뛸 수 있다.Each of the actions, acts, blocks, steps, etc. of the flowchart 300 may be performed in the order depicted, in a different order, or concurrently. Also, in some embodiments, some actions, operations, blocks, steps, etc. may be omitted, added, modified, or skipped without departing from the scope of the present invention.

도 4a 및 도 4b는 일 실시 예에 따른, 전자 장치(100)의 화면(140)에 표시된 비디오 어플에 관한 UI 요소의 자연어 발화를 예측하고, 표시된 UI 요소에 관련된 액션 시뮬레이션을 위한 모델을 동적으로(dynamically) 구축하는 방법을 설명한 흐름도의 예이다.4A and 4B are diagrams illustrating a model for predicting natural language utterance of a UI element related to a video application displayed on the screen 140 of the electronic device 100 and simulating an action related to the displayed UI element dynamically, according to an embodiment. This is an example of a flowchart explaining how to (dynamically) build.

도 4a에 따르면 401에서, 전자 장치(100)의 사용자가 화면(140)을 보고 있고, 비디오 어플의 읽지 않은 알림 (예: 종 형태 알림 아이콘)을 확인하고 싶어하는 시나리오를 가정한다. According to FIG. 4A , in 401 , it is assumed that the user of the electronic device 100 is looking at the screen 140 and wants to check an unread notification (eg, a bell-shaped notification icon) of a video application.

402(플랫폼 402)에서, 화면 읽기 APIs (예: 액세스성(402a), 광학식 문자 인식(OCR)(402b), 및 화상 분류(402c))은 화면 정보와 대응되는 특성을 독출하고, 이 화면 정보들은 컨텍스트 오케스트레이터(170a)에 공급되어 지식 그래프를 생성할 수 있다. In 402 (platform 402), screen reading APIs (eg, accessibility 402a, optical character recognition (OCR) 402b, and image classification 402c) read screen information and corresponding characteristics, and this screen information These may be supplied to the context orchestrator 170a to generate a knowledge graph.

403에서, 컨텍스트 오케스트레이터(170a)는 뷰 계층, UI 요소의 컨텐츠 (예: 텍스트적 표현 및 비텍스트적 표현), 및 UI 요소의 위치 등 화면 정보를 분석한다. 또한, 컨텍스트 오케스트레이터(170a)는 텍스트적 표현 및 UI 요소 성능의 상대적 위치와 같은 추론을 획득한다. 또한, 컨텍스트 오케스트레이터(170a)는 특정 어플 (예: 비디오 어플)에 관한 화면 시퀀스 등 정보를 저장하고 학습한다.In 403 , the context orchestrator 170a analyzes screen information such as a view hierarchy, content of UI elements (eg, textual representation and non-textual representation), and positions of UI elements. Context orchestrator 170a also obtains inferences such as the relative position of textual representation and UI element performance. In addition, the context orchestrator 170a stores and learns information such as a screen sequence related to a specific application (eg, a video application).

실시 예에서, 컨텍스트 오케스트레이터(170a)는 관계 예측기(170aa), 텍스트 표현기(170ab), UI 요소 성능 식별기(170ac), 및 화면 시퀀스 검출기(170ad)를 포함한다. 관계 예측기(170aa)는 UI 요소 (예: 종 형태 아이콘, 검색 아이콘 등)을 그룹핑하고, 상대적 위치를 결정하고, 텍스트적 표현으로 표현한다. 표 2는 관계 예측부(170aa)의 입력-출력의 예를 나타낸다.In an embodiment, the context orchestrator 170a includes a relationship predictor 170aa, a text presenter 170ab, a UI element performance identifier 170ac, and a screen sequence detector 170ad. The relationship predictor 170aa groups UI elements (eg, a bell-shaped icon, a search icon, etc.), determines their relative positions, and expresses them as textual expressions. Table 2 shows examples of input-output of the relationship prediction unit 170aa.

InputInput OutputOutput Name : Connect Device,
Position : 50, 1180,
Type : ButtonName: Connect Device,
Position : 50, 1180,
Type: Button Relation: Left of bell-shaped Button
Group : {Textbox, Button}Relation: Left of bell-shaped Button
Group : {Textbox, Button} Name : Notification,
Position :150, 1180,
Type : ButtonName: Notification,
Position :150, 1180,
Type: Button Relation: Right to connect device button
Group : {Textbox, Button}Relation: Right to connect device button
Group : {Textbox, Button}

텍스트 표현기(170ab)는 모든 비텍스트적 표현(비텍스트적 컨텐츠/UI 요소)를 텍스트적 표현(텍스트적 컨텐츠/UI 요소)로 변환하고 모든 비텍스트적 표현간의 추론을 획득한다. 표 3은 텍스트 표현기(170ab)의 입력-출력 예를 나타낸다. The text presenter 170ab converts all non-textual representations (non-textual content/UI elements) into textual representations (textual content/UI elements) and obtains inferences between all non-textual representations. Table 3 shows an input-output example of the text presenter 170ab.

InputInput OutputOutput Name : Home, Position : 50, 1180,
Type : Button, Image Buffer<>Name: Home, Position: 50, 1180,
Type: Button, Image Buffer<> Text representation: Shape : HomeText representation: Shape: Home Name : Notification, Position : 150, 1180, Type : Button, Image Buffer<>Name : Notification, Position : 150, 1180, Type : Button, Image Buffer<> Text representation: Shape : BellText representation: Shape: Bell

UI 요소 성능 식별기(170ac)는 UI 요소 (예: 동작가능한 UI 요소)의 기 정의된 성능을 식별하고 UI 요소의 성능을 컨텍스트 오케스트레이터(170a)에 저장한다. 표 4는 UI 요소 성능 식별기(170ac)의 입력-출력의 예를 나타낸다.The UI element performance identifier 170ac identifies a predefined performance of a UI element (eg, an operable UI element) and stores the performance of the UI element in the context orchestrator 170a. Table 4 shows an example of the input-output of the UI element capability identifier 170ac.

InputInput OutputOutput Name : Home, Position : 50, 1180,
Type : ButtonName: Home, Position: 50, 1180,
Type: Button Capability : Click, FocusCapability : Click, Focus

화면 시퀀스 검출기(170ad)은 컨텍스트 오케스트레이터(170a)의 일부로 기록된 추가적으로 동작가능한 UI 요소를 포함하는 뷰 화면 전이를 결정한다. 표 5는 화면 시퀀스 검출기(170ad)의 입력-출력의 예를 나타낸다.Screen sequence detector 170ad determines view screen transitions that include additional actionable UI elements recorded as part of context orchestrator 170a. Table 5 shows an example of input-output of the picture sequence detector 170ad.

InputInput OutputOutput Application: YouTube/video applicationApplication: YouTube/video application Home Screen => Account => Data Home Screen => Connect a deviceHome Screen => Account => Data Home Screen => Connect a device

또한, 컨텍스트 오케스트레이터(170a)는 관계 예측기(170aa), 텍스트 표현기(170ab), UI 요소 성능 식별기(170ac), 및 화면 시퀀스 검출기(170ad)의 기능을 이용하여 지식 그래프(404)를 획득한다. 표 6는 획득된 지식 그래프(404)의 알림(Notification)(UI 요소)에 관한 예를 나타낸다.In addition, the context orchestrator 170a obtains a knowledge graph 404 using the functions of the relationship predictor 170aa, the text presenter 170ab, the UI element performance identifier 170ac, and the screen sequence detector 170ad. . Table 6 shows an example of a notification (UI element) of the acquired knowledge graph 404 .

UI elementUI element Knowledge graph
informationknowledge graph
information NotificationNotification Loc: (1020, 10)Loc: (1020, 10) Name: NotificationsName: Notifications Content shape: BellContent shape: Bell Content-type: IconContent-type: Icon Type: ButtonType: Button Capability: Press, FocusCapability: Press, Focus

도 4b를 참고하면, 405에서 시각적 시맨틱 자연어 예측기(170b)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. Referring to FIG. 4B , at 405 , the visual semantic natural language predictor 170b receives the knowledge graph obtained from the context orchestrator 170a.

또한, 시각적 시맨틱 자연어 예측기(170b)는 수신된 지식 그래프의 각 UI 요소를 도메인, 동사, 동의어, 슬롯, 슬롯 타입, 텍스트적으로 표현된 슬롯, 성능, 및 전자 장치(100)의 화면의 상대적 위치(140)로 분류한다. 또한, 시각적 시맨틱 자연어 예측기 (170b)는 분류에 기초하여 자연어 변형을 획득한다. 표 7은 시각적 시맨틱 자연어 예측기(170b)의 알림(UI 요소)에 관한 자연어 분류 및 자연어 변형의 예를 나타낸다.In addition, the visual semantic natural language predictor 170b calculates each UI element of the received knowledge graph as a domain, a verb, a synonym, a slot, a slot type, a textually expressed slot, a performance, and a relative position of the screen of the electronic device 100 . (140). Also, the visual semantic natural language predictor 170b obtains a natural language transformation based on the classification. Table 7 shows examples of natural language classification and natural language transformations related to notifications (UI elements) of the visual semantic natural language predictor 170b.

DomainDomain VideoVideo ApplicationApplication YouTubeYouTube NL categorization
(notification)NL categorization
(notification) Relation : { Left of Search, Right of Connect}Relation : { Left of Search, Right of Connect} Verb: Touch, Tap, ClickVerb: Touch, Tap, Click Slot : { Bell <Shape>, Notification <name> }Slot : { Bell <Shape>, Notification <name> } Type : { Button, Icon, Image }Type : { Button, Icon, Image } NL variationNL variation Click on the bell-shaped buttonClick on the bell-shaped button Tap the icon left to search iconTap the icon left to search icon

406 에서, 액션 시퀀스 플래너(170c)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. 또한, 액션 시퀀스 플래너(170c)는 각 UI 요소의 성능에 기초하여 싱글 스텝 및 멀티 스텝 명령어/의도에 관한 동작 루틴/액션 시퀀스 그래프를 결정한다. 또한, 액션 시퀀스 플래너(170c)는 사용자 인터페이스에서 발견한 성능 리스트에 관련한 동작 루틴/시퀀스의 기 정의된 세트를 결정한다. 동작 루틴/시퀀스는 정적이거나 동적일 수 있다. 정적 액션 시퀀스는 시퀀스가 변화하지 않기 때문에 표로 만든다. 동적 액션 시퀀스는 지식 그래프의 화면 시퀀스에 기초하여 실행중에 구축한다. 표 8은 액션 시퀀스 플래너(170c)의 UI 요소에 관한 동작 루틴의 예를 나타낸다.At 406 , the action sequence planner 170c receives the obtained knowledge graph from the context orchestrator 170a . In addition, the action sequence planner 170c determines an action routine/action sequence graph for single-step and multi-step commands/intents based on the performance of each UI element. In addition, the action sequence planner 170c determines a predefined set of action routines/sequences related to the performance list found in the user interface. Operation routines/sequences may be static or dynamic. Static action sequences are tabulated because the sequence does not change. A dynamic action sequence is built during execution based on the screen sequence of the knowledge graph. Table 8 shows an example of an action routine related to a UI element of the action sequence planner 170c.

UI element typeUI element type CapabilityCapability Action RoutinesAction Routines ButtonButton ClickClick Subroutine: Click
Parameters: UI element positionSubroutine: Click
Parameters: UI element position Text boxtext box ClickClick Subroutine: Write
Parameters: UI Element, ContentSubroutine: Write
Parameters: UI Element, Content Text boxtext box FocusFocus Subroutine: Focus
Parameters: UI ElementSubroutine: Focus
Parameters: UI Element List boxlist box Scroll upScroll up Subroutine: Scroll Up
Parameters: UI ElementSubroutine: Scroll Up
Parameters: UI Element Multi-step Ex
switch accountMulti-step Ex
switch account Click (Account) > Click(Switch
Account)Click (Account) > Click (Switch)
Account)

또한, 액션 시퀀스 플래너(170c)는 컨텍스트 오케스트레이터(170a)의 지식에 저장된 화면의 동작가능한 UI 요소 그래프에 기초하여 액션 시퀀스 식별(identity)을 결정한다. 화면의 동작가능한 UI 요소 그래프는 기 정의된 입력일 수 있거나 사용자의 화면 내비게이션 동작들(터치 및 음성 기반 동작 모두)를 이해하는 것으로부터 학습할 수 있다.In addition, the action sequence planner 170c determines the action sequence identity based on the actionable UI element graph of the screen stored in the knowledge of the context orchestrator 170a. The screen's actionable UI element graph can be a predefined input or can learn from understanding the user's screen navigation actions (both touch and voice-based actions).

407-408에서, 자연어 합성기(170d)는 시각적 시맨틱 자연어 예측기(170b) 및 액션 시퀀스 플래너(170c)로부터 입력을 수신한다. 자연어 합성기(170d)는 유사한 자연어 변형을 클러스터링하고, 자연어 변형은 시각적 시맨틱 자연어 예측기(170b)에에 의해 결정된다. 그 후, 자연어 합성기(170d)는 유사한 자연어 변형에 관한 동적 의도를 할당하고, 동적 의도를 식별된 액션과 식별된 액션 시퀀스에 관련시키며, 식별된 액션과 식별된 액션 시퀀스는 액션 시퀀스 플래너(170c)에 의해 결정된다. 표 9는 자연어 합성기(170d)에서 동적 의도와 유사한 자연어 변형과 액션 시퀀스의 예를 나타낸다. At 407 - 408 , natural language synthesizer 170d receives inputs from visual semantic natural language predictor 170b and action sequence planner 170c . The natural language synthesizer 170d clusters similar natural language variants, and the natural language variants are determined by the visual semantic natural language predictor 170b. The natural language synthesizer 170d then assigns dynamic intents for similar natural language variants, and associates the dynamic intents with the identified actions and the identified action sequences, and the identified actions and the identified action sequences are combined with the action sequence planner 170c. is determined by Table 9 shows examples of natural language transformations and action sequences similar to dynamic intentions in the natural language synthesizer 170d.

Dynamic intentDynamic intent ID-1ID-1 NL variationsNL variations Click on the bell-shaped buttonClick on the bell-shaped button Click search buttonClick search button Tap plus imageTap plus image Action sequenceaction sequence AS1AS1

409에서, 동적 언어 획득기(170e)는 자연어 합성기(170d)로부터의 입력을 수신하고 클러스터링된(clustered) 자연어 변형, 동적 의도, 및 동작 루틴에 기초하여 자연어 모델을 동적으로 획득한다. 또한, 동적 언어 획득기(170e)는 데이터베이스 내의 동적으로 획득된 자연어 모델을 저장한다. 410에서, 동적 언어 획득기(170e)는 음성 입력 (예: 종 모양 버튼을 클릭해줘)를 수신한다. 동적 언어 획득기(170e)는 수신된 음성 입력의 발화가 적어도 하나의 식별된 UI 요소의 예측된 자연어 발화와 일치하는지 판단한다. 411-412에서, 사용자로부터 입력된 수신된 음성의 발화가 적어도 하나의 식별된 UI 요소의 예측된 자연어 발화와 일치하는 경우, 동적 언어 획득기(170e)는 전자 장치(100)의 화면(140)에 표시된 비디오 어플의 UI 요소(예: 알림)에 자동적으로 액세스한다. 결과적으로, 사용자는 사용자 경험을 향상시키는 식별된 UI 요소에 액세스하는 자연어를 활용할 수 있다. 제안된 방법은 사용자가 전자 장치(100)의 화면에 표시된 버튼의 진짜 이름을 모르더라도, 자연스럽게 전자 장치(100)의 화면에 표시된 버튼을 사용자가 작동시키도록 사용자 인터랙션을 간단하게 한다. 제안된 방법은 텍스트적으로 비텍스트적 성분의 품질과 관계를 나타낸다. At 409 , dynamic language acquirer 170e receives input from natural language synthesizer 170d and dynamically acquires a natural language model based on clustered natural language transformations, dynamic intent, and action routines. Also, the dynamic language acquirer 170e stores the dynamically acquired natural language model in the database. At 410 , the dynamic language acquirer 170e receives a voice input (eg, click a bell button). The dynamic language acquirer 170e determines whether an utterance of the received voice input matches a predicted natural language utterance of the at least one identified UI element. In 411-412 , when the utterance of the received voice input from the user matches the predicted natural language utterance of the at least one identified UI element, the dynamic language acquirer 170e is configured to display the screen 140 of the electronic device 100 . Automatically access UI elements (such as notifications) of the video app displayed in . As a result, users can utilize natural language to access identified UI elements that enhance the user experience. The proposed method simplifies user interaction so that the user naturally operates the button displayed on the screen of the electronic device 100 even if the user does not know the real name of the button displayed on the screen of the electronic device 100 . The proposed method shows the quality and relationship of non-textual components in a textual way.

도 5는 일 실시 예에 따른, 전자 장치(100)가 화면(140)에 표시된 소셜 미디어 어플에 관한 UI 요소의 자연어 발화를 예측하고, 표시된 UI 요소와 관련된 액션 시뮬레이션을 위한 모델을 동적으로 구축하는 시나리오의 예이다.5 is a diagram in which the electronic device 100 predicts natural language utterance of a UI element related to a social media application displayed on the screen 140 and dynamically builds a model for simulating an action related to the displayed UI element, according to an embodiment. This is an example of a scenario.

501에서, 전자 장치(100)의 사용자가 소셜 미디어 어플(예: F-share)를 통해 열람하고 화면(140)에 표시된 그림에 좋아요를 매우 자연스럽게 표현하는 시나리오를 가정한다.In 501 , a scenario is assumed in which the user of the electronic device 100 browses through a social media application (eg, F-share) and expresses a liking for a picture displayed on the screen 140 very naturally.

기술 흐름은 도 4에서 설명한 것과 동일하다. 502에서, 컨텍스트 오케스트레이터(170a)는 관계 예측기(170aa), 텍스트적 표현기(170ab), UI 요소 성능 식별기(170ac), 및 화면 시퀀스 검출기(170ad)의 기능을 이용하여 지식 그래프를 획득한다. 표 10은 획득된 지식 그래프의 알림(UI 요소)에 관한 예를 나타낸다.The technology flow is the same as described in FIG. 4 . At 502 , the context orchestrator 170a obtains a knowledge graph using the functions of the relationship predictor 170aa , the textual presenter 170ab , the UI element performance identifier 170ac , and the screen sequence detector 170ad . Table 10 shows an example regarding the notification (UI element) of the acquired knowledge graph.

UI elementUI element Knowledge graph informationknowledge graph NotificationNotification Loc: (700, 10)Loc: (700, 10) Name: LikeName: Like Content shape: HeartContent shape: Heart Content-type: iconContent-type: icon Type: ButtonType: Button Capability: Press, FocusCapability: Press, Focus

그 후, 시각적 시맨틱 자연어 예측기(170b)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 수신된 지식 그래프의 각 UI 요소를 도메인, 동사, 동의어, 슬롯, 슬롯 타입, 텍스트적으로 표현된 슬롯, 성능, 및 전자 장치(100)의 화면의 상대적 위치(140)로 분류한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 분류에 기초하여 자연어 변형을 획득한다. 표 11은 시각적 시맨틱 자연어 예측기(170b)의 알림(UI 요소)에 관한 자연어 분류와 자연어 변형의 예를 나타낸다. Then, the visual semantic natural language predictor 170b receives the obtained knowledge graph from the context orchestrator 170a. In addition, the visual semantic natural language predictor 170b calculates each UI element of the received knowledge graph as a domain, a verb, a synonym, a slot, a slot type, a textually expressed slot, a performance, and a relative position of the screen of the electronic device 100 . (140). In addition, the visual semantic natural language predictor 170b obtains a natural language transformation based on the classification. Table 11 shows examples of natural language classification and natural language transformations related to notifications (UI elements) of the visual semantic natural language predictor 170b.

DomainDomain ImageImage 어플app F-shareF-share NL categorization
(notification)NL categorization
(notification) Relation : { Bottom to photo, Above
Android}Relation : { Bottom to photo, Above
Android} Verb: Touch, Tap, Click …Verb: Touch, Tap, Click … Slot: { Heart, Like, Love}Slot: { Heart, Like, Love} Type: { Button, Icon, Image }Type: { Button, Icon, Image } NL variationNL variation Click on the heart-shaped iconClick on the heart-shaped icon I like the photoI like the photo

그 후, 액션 시퀀스 플래너(170c)는 컨텍스트 오케스트레이터(170a)로부터 획득된 지식 그래프를 수신한다. 또한, 액션 시퀀스 플래너(170c)는 각 UI 요소의 성능에 기초하여 싱글 스텝 및 멀티 스텝 명령어/의도에 관한 동작 루틴/액션 시퀀스 그래프를 결정한다. 또한, 액션 시퀀스 플래너(170c)는 사용자 인터페이스에서 획득된 성능 리스트에 관련한 동작 루틴/시퀀스의 기 정의된 세트를 결정한다. 동작 루틴/시퀀스는 정적이거나 동적일 수 있다. 정적 액션 시퀀스는 시퀀스가 변화하지 않기 때문에 표로 만든다. 동적 액션 시퀀스는 지식 그래프의 화면 시퀀스에 기초하여 실행 중에 구축된다(사용자가 화면을 제어하는 동안, 동적 액션 시퀀스가 실시간으로 획득된다. 이것은 명확하게 액션 시퀀스를 설계하기 위해 개발자가 필요했던 전통적 액세스과는 다르다).Then, the action sequence planner 170c receives the obtained knowledge graph from the context orchestrator 170a. In addition, the action sequence planner 170c determines an action routine/action sequence graph for single-step and multi-step commands/intents based on the performance of each UI element. Further, the action sequence planner 170c determines a predefined set of action routines/sequences related to the performance list obtained in the user interface. Operation routines/sequences may be static or dynamic. Static action sequences are tabulated because the sequence does not change. The dynamic action sequence is built during execution based on the screen sequence in the knowledge graph (while the user controls the screen, the dynamic action sequence is acquired in real time. This is different from the traditional access required by the developer to design the action sequence explicitly. different).

또한, 자연어 합성기(170d)는 시각적 시맨틱 자연어 예측기(170b)와 액션 시퀀스 플래너(170c)로부터 입력을 받는다. 자연어 합성기(170d)는 유사한 자연어 변형들을 모으고, 해당 자연어 변형들은 시각적 시맨틱 자연어 예측기(170b)에 의해 결정된다. 또한, 자연어 합성기(170)는 유사한 자연어 변형들에 대해 동적 의도를 할당하고, 해당 동적 의도를 식별된 액션과 식별된 액션 시퀀스와 연관시키며, 이 식별된 액션과 식별된 액션 시퀀스는 액션 시퀀스 플래너(170c)에 의해 결정된다. 표 12는 자연어 합성기(170d)에서의 유사한 자연어 변형들 및 동적 의도와 액션 시퀀스의 예를 나타낸다. In addition, the natural language synthesizer 170d receives inputs from the visual semantic natural language predictor 170b and the action sequence planner 170c. The natural language synthesizer 170d collects similar natural language variants, and the corresponding natural language variants are determined by the visual semantic natural language predictor 170b. In addition, natural language synthesizer 170 assigns dynamic intents to similar natural language variants, associates the dynamic intents with identified actions and identified action sequences, which identified actions and identified action sequences are combined with the action sequence planner ( 170c). Table 12 shows examples of similar natural language variants and dynamic intent and action sequences in natural language synthesizer 170d.

Dynamic intentDynamic intent ID-2ID-2 NL variations NL variations Click on the like iconClick on the like icon I like the photoI like the photo Click on the love button Click on the love button Action sequence action sequence AS2 AS2

동적 언어 생성기(170e)는 자연어 합성기(170d)로부터 입력을 수신하고, 모아진 자연어 변형들, 동적 의도, 및 동작 루틴들에 기초하여 동적으로 자연어 모델을 생성시킨다. 또한, 동적 언어 생성기(170e)는 동적으로 생성된 자연어 모델을 데이터베이스에 저장한다. 동적 언어 생성기(170e)는 음성 입력을 수신한다(예: 사진을 좋아합니다). 동적 언어 생성기(170e)는 입력 받은 음성 입력의 발화가 식별된 적어도 하나의 UI 요소의 예측된 자연어 발화들과 매칭되는지 판단한다. 입력 받은 음성 입력의 발화가 식별된 적어도 하나의 UI 요소의 예측된 자연어 발화들과 매칭되면, 동적 언어 생성기(170e)는 전자 장치(100)의 화면(140)에 표시된 소셜 미디어 애플리케이션의 UI 요소(예: 하트 모양 아이콘)에 자동적으로 액세스한다. 결과적으로, 사용자는 식별된 UI 요소(들)에 액세스하기 위해 자연어를 활용할 수 있고 이는 사용자의 경험을 향상시킨다.The dynamic language generator 170e receives input from the natural language synthesizer 170d and dynamically generates a natural language model based on the collected natural language variants, dynamic intent, and action routines. In addition, the dynamic language generator 170e stores the dynamically generated natural language model in the database. The dynamic language generator 170e receives a voice input (eg likes a picture). The dynamic language generator 170e determines whether an utterance of the received voice input matches predicted natural language utterances of at least one identified UI element. When the utterance of the received voice input matches the predicted natural language utterances of the identified at least one UI element, the dynamic language generator 170e displays the UI element of the social media application displayed on the screen 140 of the electronic device 100 ( e.g. a heart-shaped icon). As a result, the user may utilize natural language to access the identified UI element(s), which enhances the user's experience.

도 6은 일 실시 예에 따른, 전자 장치(100)와의 음성 인터렉션에 기초하여 전자 장치(100)의 화면(140)에 표시된 UI 요소에 자동 액세스하기 위한 스코어 메커니즘을 나타낸다. 6 illustrates a score mechanism for automatically accessing a UI element displayed on the screen 140 of the electronic device 100 based on a voice interaction with the electronic device 100, according to an embodiment.

텍스트 표현 스코어(601)는 전자 장치(100)의 화면(140)에 표시된 화면 정보를 읽음으로써 결정되며, 화면 정보는 텍스트 표현/텍스트 UI 요소의 형태이다. 또한, 컨텍스트 오케스트레이터(170a)가 텍스트 표현 스코어(601)를 포함한다. 텍스트 표현/텍스트 UI 요소(들)로부터 동사와 명사의 유의어들을 추출하는 동안, 유의어 스코어(602)가 할당된다. 유의어들과 주요 표현(텍스트 표현) 사이의 거리가 멀어질수록 유의어 스코어(602)는 낮아진다. 시각적 시맨틱 자연어 예측기 170b)는 유의어 스코어(602)를 포함하고, 텍스트 표현 스코어(601)로부터 입력을 받는다. 변형 스코어(603)는 가중치 학습을 수행하는데 사용될 수 있는 생성된 자연어 변형들을 동적 언어 생성기(170e)와 연관시킴으로써 결정된다. 관련성 스코어(604)는 근접한 요소 정보를 입력 받은 음성 입력의 발화에서 언급된 참조 오브젝트들과 비교함으로써 결정된다. 매칭 스코어(605)는 입력 받은 음성 입력의 발화를 동적 언어 생성기(170e)와 매칭하고, 이를 관련성 스코어(604)와 결합한 최종 스코어이다.The text expression score 601 is determined by reading screen information displayed on the screen 140 of the electronic device 100 , and the screen information is in the form of a text expression/text UI element. Context orchestrator 170a also includes text representation score 601 . While extracting synonyms of verbs and nouns from text representation/text UI element(s), a synonym score 602 is assigned. The greater the distance between the synonyms and the main expression (textual representation), the lower the synonym score 602 . The visual semantic natural language predictor 170b includes a synonym score 602 and receives an input from a text representation score 601 . The variant score 603 is determined by associating the generated natural language variants with the dynamic language generator 170e that may be used to perform weight learning. The relevance score 604 is determined by comparing adjacent element information with reference objects mentioned in the utterance of the received voice input. The matching score 605 is a final score obtained by matching the utterance of the received voice input with the dynamic language generator 170e and combining it with the relevance score 604 .

일 실시 예에서, 인터렉션 엔진(170)은 화면 정보로부터 가장 잘 매칭되는 컨텐츠를 추출한다. 복수의 매칭이 존재하는 경우, 명확한 조율을 위해 해당 스코어를 사용한다.In one embodiment, the interaction engine 170 extracts the best matching content from the screen information. If multiple matches exist, the corresponding score is used for clear reconciliation.

도 7은 일 실시 예에 따른, 전자 장치(100)가 화면(140)에 표시된 소셜 미디어 애플리케이션의 UI 요소의 자연어 발화들을 예측하고, 표시된 UI 요소과 연관된 액션 시뮬레이션에 대한 모델들을 동적으로 구축하기 위해 스코어 메커니즘을 사용하는 예시적 시나리오이다.7 is a score in order for the electronic device 100 to predict natural language utterances of a UI element of a social media application displayed on the screen 140 and dynamically build models for an action simulation associated with the displayed UI element, according to an embodiment. This is an example scenario using the mechanism.

701에서, 전자 장치(100)의 사용자가 소셜 미디어 애플리케이션(예: F-share)을 탐색하고 있고 화면(140)에 표시된 사진을 좋아한다는 매우 자연스러운 방식을 표현하고 있는 시나리오를 가정하도록 한다. 이 시나리오에서는 인터렉션 엔진(170)에 의해 서로 다른 스코어들이 검색되며, 인터렉션 엔진(170)은 발견된 화면 매칭들 중 최고의 매칭을 선별한다. 화면(140)에는 요소 1과 요소 2로 이름 지어진 두 개의 하트 모양 아이콘들이 존재한다. 이 시나리오에서는 사용자에게 어느 것이 가장 좋을지 논의될 수 있다. At 701 , assume a scenario in which the user of the electronic device 100 is browsing a social media application (eg, F-share) and expresses a very natural way of liking a picture displayed on the screen 140 . In this scenario, different scores are searched for by the interaction engine 170 , and the interaction engine 170 selects the best match among the found screen matches. There are two heart-shaped icons named element 1 and element 2 on the screen 140 . In this scenario, it can be discussed which one is best for the user.

컨텍스트 오케스트레이터(170a)는 관련성 측정기(170aa), 텍스트 표현기(170ab), UI 요소 기능 판단기(170ac), 및 화면 시퀀스 탐지기(170ad)의 기능들을 사용하여 지식 그래프를 생성시킨다. 표 13은 생성된 지식 그래프의 알림(UI 요소)에 대한 예를 나타낸다.The context orchestrator 170a generates a knowledge graph using the functions of the relevance measurer 170aa, the text presenter 170ab, the UI element function determiner 170ac, and the screen sequence detector 170ad. Table 13 shows an example of a notification (UI element) of the generated knowledge graph.

UI elementUI element Knowledge graph informationknowledge graph NotificationNotification Loc: (700, 10) Loc: (700, 10) Name: Like Name: Like Content shape: Heart (098) “textual
representation score (601)”Content shape: Heart (098) “textual
representation score (601)” Content-type: icon Content-type: icon Type: Button Type: Button Capability: Press, Focus Capability: Press, Focus

또한, 시각적 시맨틱 자연어 예측기(170b)는 컨텍스트 오케스트레이터(170a)로부터 생성된 지식 그래프를 입력받는다. 또한, 시각적 시맨틱 자연어 예측기 (170b)는 입력받은 지식 그래프의 각 UI 요소를 도메인, 동사, 유의어, 슬롯, 슬롯 타입, 텍스트 표현된 슬롯, 기능, 및 전자 장치(100)의 화면에서의 상대적 위치(140)로 분류한다. 또한, 시각적 시맨틱 자연어 예측기(170b)는 해당 분류에 기초하여 자연어 변형들을 생성시킨다. 표 14는 시각적 시맨틱 자연어 예측기(170b)의 알림(UI 요소)에 대한 자연어 분류 및 자연어 변형의 예를 나타낸다. In addition, the visual semantic natural language predictor 170b receives the knowledge graph generated from the context orchestrator 170a. In addition, the visual semantic natural language predictor 170b compares each UI element of the input knowledge graph with a domain, a verb, a synonym, a slot, a slot type, a text-expressed slot, a function, and a relative position on the screen of the electronic device 100 ( 140). In addition, the visual semantic natural language predictor 170b generates natural language variants based on the classification. Table 14 shows examples of natural language classification and natural language transformations for notifications (UI elements) of the visual semantic natural language predictor 170b.

도메인domain 이미지image 애플리케이션application F-share F-share 관련성: {사진 아래, Android 위} RELATED: {below photo, above Android} 자연어
분류
(알림)natural language
classification
(notice) 동사: 터치(0.7), 탭 (0.9), 클릭 (1.0)
"유의어 스코어 (602)"Verbs: touch (0.7), tap (0.9), click (1.0)
"Thesaurus Score (602)" 슬롯: {하트(1.0), 좋아요 (0.8), 아주 좋아요 (0.8)}
"유의어 스코어 (602)"Slots: {Heart(1.0), Like(0.8), Very Like(0.8)}
"Thesaurus Score (602)" 타입: {버튼, 아이콘, 이미지} Type: {button, icon, image} 자연어 변형 natural language transformation 하트 모양 아이콘 클릭 (0.98) "변형 스코어 (603)" Click the heart icon (0.98) "Variation score (603)" 사진을 좋아합니다 (0.8) "변형 스코어 (603)" Likes the picture (0.8) "Variation score (603)"

또한, 액션 시퀀스 플래너(170c)는 컨텍스트 오케스트레이터(170a)로부터 생성된 지식 그래프를 입력 받는다. 또한, 액션 시퀀스 플래너(170c)는 각 UI 요소의 기능에 기초하여 싱글 스텝 및 멀티 스텝 명령어/의도에 대한 액션 루틴/액션 시퀀스 그래프를 결정한다. 또한, 액션 시퀀스 플래너(170c)는 사용자 인터페이스에서 발견된 기능 리스트와 연관된 기설정된 액션 루틴/시퀀스 세트를 결정한다. 액션 루틴/시퀀스는 정적 또는 동적일 수 있다. 정적 액션 시퀀스는 시퀀스가 변화되지 않으므로 표로 생성된다. 동적 액션 시퀀스는 지식 그래프의 화면 시퀀스에 기초하여 온 플라이(on fly) 형태로 구성된다.Also, the action sequence planner 170c receives the knowledge graph generated from the context orchestrator 170a. In addition, the action sequence planner 170c determines an action routine/action sequence graph for single-step and multi-step commands/intents based on the function of each UI element. In addition, the action sequence planner 170c determines a preset set of action routines/sequences associated with the list of functions found in the user interface. Action routines/sequences can be static or dynamic. A static action sequence is created as a table since the sequence does not change. The dynamic action sequence is configured on-fly based on the screen sequence of the knowledge graph.

또한, 자연어 합성기(170d)는 시각적 시맨틱 자연어 예측기(170b)와 액션 시퀀스 플래너(170c)로부터 입력을 받는다. 자연어 합성기(170d)는 유사한 자연어 변형들을 클러스터링하고, 해당 자연어 변형들은 시각적 시맨틱 자연어 예측기 (170b)에 의해 결정된다. 또한, 자연어 합성기(170)는 유사한 자연어 변형들에 대해 동적 의도를 할당하고, 해당 동적 의도를 식별된 액션과 식별된 액션 시퀀스를 연관시키며, 이 식별된 액션과 식별된 액션 시퀀스는 액션 시퀀스 플래너 (170c)에 의해 결정된다. 표 15는 자연어 합성기(170d)에서의 유사한 자연어 변형들 및 동적 의도와 액션 시퀀스의 예를 나타낸다. In addition, the natural language synthesizer 170d receives inputs from the visual semantic natural language predictor 170b and the action sequence planner 170c. The natural language synthesizer 170d clusters similar natural language variants, and the corresponding natural language variants are determined by the visual semantic natural language predictor 170b. In addition, natural language synthesizer 170 assigns a dynamic intent to similar natural language variants, associates the dynamic intent with an identified action and an identified action sequence, and the identified action and the identified action sequence are combined with the action sequence planner ( 170c). Table 15 shows examples of similar natural language variants and dynamic intent and action sequences in natural language synthesizer 170d.

또한, 동적 언어 생성기(170e)는 자연어 합성기(170d)로부터 입력을 수신하고, 클러스터링된 자연어 변형들, 동적 의도, 및 동작 루틴들에 기초하여 동적으로 자연어 모델을 생성한다. 또한, 동적 언어 생성기(170e)는 동적으로 생성된 자연어 모델을 데이터베이스에 저장한다. 동적 언어 생성기(170e)는 음성 입력(I like the photo)을 수신하고, 동적 언어 생성기(170e)는 수신된 음성 입력의 발화가 식별된 적어도 하나의 UI 요소의 예측된 자연어 발화들과 매칭되는지 판단한다. 표 16은 동적 언어 생성기(170e)에서의 후보 추정의 예를 나타낸다. In addition, the dynamic language generator 170e receives input from the natural language synthesizer 170d and dynamically generates a natural language model based on the clustered natural language variants, dynamic intent, and action routines. In addition, the dynamic language generator 170e stores the dynamically generated natural language model in the database. The dynamic language generator 170e receives a voice input (I like the photo), and the dynamic language generator 170e determines whether an utterance of the received voice input matches predicted natural language utterances of at least one identified UI element. do. Table 16 shows an example of candidate estimation in the dynamic language generator 170e.

Candidate estimation Candidate estimation Element-1Element-1 Heart Loc (700, 10) (08)Heart Loc (700, 10) (08) Element-2Element-2 Heart Loc (980, 1200) (08) Heart Loc (980, 1200) (08)

또한, 동적 언어 생성기(170e)는 근접한 요소 정보를 입력 받은 음성 입력의 발화에서 언급된 참조 오브젝트들과 비교함으로써 관련성 스코어(604)를 결정한다. (즉, I like the photo(0.9)). In addition, the dynamic language generator 170e determines the relevance score 604 by comparing the adjacent element information with reference objects mentioned in the utterance of the received voice input. (i.e. I like the photo(0.9)).

또한, 동적 언어 생성기(170e)는 매칭 스코어(605)를 최종 스코어로 결정한다. 표-17은 동적 언어 생성기(170e)에서의 매칭 스코어(605)의 예를 나타낸다.Also, the dynamic language generator 170e determines the matching score 605 as the final score. Table-17 shows an example of a match score 605 in the dynamic language generator 170e.

Final output final output Element-1Element-1 Heart Loc (700, 10) (09) (winner)Heart Loc (700, 10) (09) (winner) Element-2Element-2 Heart Loc (980, 1200) (08) Heart Loc (980, 1200) (08)

수신된 음성 입력의 발화가 식별된 UI 요소의 예측된 자연어 발화들과 매칭되면, 동적 언어 생성기(170e)는 전자 장치(100)의 화면(140)에 표시된 소셜 미디어 애플리케이션의 UI 요소(예: 하트 아이콘(요소-1))에 자동적으로 액세스한다. 이 시나리오에서, 사용자는 "좋아요"와 "사진"을 언급하고, 인터렉션 엔진(170)은 표 17에 기초하여 요소-1 하트 아이콘을 동작하기로 되어 있었다. When the utterance of the received voice input matches the predicted natural language utterances of the identified UI element, the dynamic language generator 170e displays a UI element (eg, heart) of the social media application displayed on the screen 140 of the electronic device 100 . automatically access the icon (element-1). In this scenario, the user mentions "Like" and "Photo", and the interaction engine 170 was supposed to actuate the element-1 heart icon based on Table 17.

702에서, 동적 언어 생성기(170e)가 음성 입력(예: 하트 아이콘 클릭)을 수신하는 시나리오를 가정하자. 동적 언어 생성기(170e)는 수신된 음성 입력의 발화가 식별 UI 요소의 예측된 자연어 발화들과 매칭되는지 판단한다. 표 18은 동적 언어 생성기(170e)에서의 후보 추정의 예를 나타낸다. At 702 , assume a scenario in which the dynamic language generator 170e receives a voice input (eg, clicking a heart icon). The dynamic language generator 170e determines whether the utterance of the received voice input matches predicted natural language utterances of the identification UI element. Table 18 shows an example of candidate estimation in dynamic language generator 170e.

또한, 동적 언어 생성기(170e)는 근접한 요소 정보를 입력 받은 음성 입력의 발화에서 언급된 참조 오브젝트들과 비교함으로써 관련성 스코어(604)를 결정한다. (즉, 하트 아이콘 클릭(0)). 또한, 동적 언어 생성기(170e)는 매칭 스코어(605)를 최종 스코어로 결정한다. 표 19는 동적 언어 생성기(170e)에서의 매칭 스코어(605)의 예를 나타낸다. In addition, the dynamic language generator 170e determines the relevance score 604 by comparing the adjacent element information with reference objects mentioned in the utterance of the received voice input. (ie click the heart icon (0)). Also, the dynamic language generator 170e determines the matching score 605 as the final score. Table 19 shows an example of a match score 605 in the dynamic language generator 170e.

Final output final output Element-1Element-1 Heart Loc (700, 10) (08)Heart Loc (700, 10) (08) Element-2Element-2 Heart Loc (980, 1200) (08) Heart Loc (980, 1200) (08)

이 시나리오의 동일한 스코어를 가진 복수의 매칭들에 대해, 사용자가 그들 중 하나를 선택할 수 있게 하기 위해 명확화 과정(disambiguation flow)을 시작할 필요가 있다. 발화는 어떤 지지(supportive) 정보 없이 대상 요소에 대한 인디케이션을 가지므로, 관련성 스코어(604)는 0이다. 그러므로, 이러한 과정은 사용자 인터페이스에 불명확성을 제공할 것이다. For multiple matches with the same score in this scenario, it is necessary to start a disambiguation flow to allow the user to select one of them. Since the utterance has an indication to the target element without any supportive information, the relevance score 604 is zero. Therefore, this process will introduce ambiguity to the user interface.

도 8은 본 개시의 일 실시예에 따른, 제한된 제어 및 탐색 명령어의 세트를 가지고 있는 기존의 방법들과, 멀티 스텝 의도를 위한 UI 요소의 자연어 발화들을 예측하는 제안된 방법의 비교를 나타낸다.8 shows a comparison of existing methods with a limited set of control and search instructions and a proposed method of predicting natural language utterances of UI elements for multi-step intent, according to an embodiment of the present disclosure.

801 내지 803은 기존의 방법들을 나타내며, 여기에서 사용자가 기존의 전자 장치(10)의 화면에 표시된 탐색 애플리케이션의 페이지를 저장/북마크하기를 원하는 시나리오를 가정하도록 한다. 801 to 803 indicate existing methods, where a scenario in which a user wants to save/bookmark a page of a search application displayed on the screen of the existing electronic device 10 is assumed.

해당 시나리오에서, 기존의 방법은 이 탐색 애플리케이션의 전자 장치(10)의 화면에 표시되지 않은 부 기능 혹은 부 페이지에 액세스하지 않으므로, 사용자는 수동으로 단계 별 과정을 수행해야 한다. 기존의 방법들을 사용하여, 사용자는 단지 전자 장치(10)의 화면에 표시된 UI 요소에만 액세스할 수 있다. In this scenario, the existing method does not access the sub-function or sub-page not displayed on the screen of the electronic device 10 of this navigation application, so the user has to manually perform the step-by-step process. Using existing methods, the user can only access UI elements displayed on the screen of the electronic device 10 .

804 내지 806은 본 개시에서 제언된 방법을 나타내며, 여기에서 사용자는 자연어를 활용함으로써(즉, 북마크를 전자 장치(100)에 대한 음성 입력으로 저장함으로써) 표시된 애플리케이션/UI 요소의 전자 장치의 화면에 표시되지 않은 부 기능 혹은 부 페이지에 액세스할 수 있으며, 이는 사용자의 경험을 향상시킬 수 있게 된다. 804 to 806 indicate the method proposed in the present disclosure, wherein the user writes to the screen of the electronic device of the displayed application/UI element by utilizing natural language (ie, by storing the bookmark as a voice input to the electronic device 100). Sub-functions or sub-pages that are not displayed may be accessed, which may improve the user's experience.

도 9a 내지 도 9c는 일 실시 예에 따른, 화면(140)에 표시된 UI 요소에 자동적으로 액세스하기 위한 전자 장치(100)의 자연어 합성기(170d)와 관련된 기능들을 나타낸다.9A to 9C illustrate functions related to the natural language synthesizer 170d of the electronic device 100 for automatically accessing a UI element displayed on the screen 140, according to an exemplary embodiment.

자연어 합성기(170d)는 자연어 생성 태스크(NL generation task)(901), 동적 의도 라벨(dynamic intent label)(902), 동적 언어 모델(dynamic language model )(903), 및 동적 의사 캡슐(dynamic pseudo capsule)(904)과 같은 다양한 기능들을 수행한다. 자연어 생성 태스크(901)는 다음과 같은 모듈들을 포함하며, 시맨택 피쳐 빌더(Semantic feature builder)가 모든 화면 정보 및 대응되는 문법 태그들을 축적한다. 커맨드 인코더(command encoder)가 문법 태그들을 사용하여 슬롯 플레이스 홀더(slot place holders)와 함께 변형들을 획득한다. 커맨드 디코더(command decoder)는 실시간 화면 정보, 즉 태그 값들과 그 대용들을 명령 암호기의 출력된 변형들에 사용한다. 커맨트 디코더는 최종 발화 변형들을 획득한다. 동적 의도 라벨(902)은 다음과 같은 모듈들을 포함하며, 무감독 유사성 분류기(unsupervised similarity classifier)는 유사한 발화들을 판단하고 버킷들(buckets)로 분류하는 모듈이다. 동적 라벨 생성기는 동적 ID를 하나의 버킷 안에 있는 유사한 발화 그룹에 할당한다. 동적 언어 모델(903)은 다음과 같은 모듈들을 포함하며, 슬롯 확장 모듈은 완전한 발화를 획득하기 위해 변형들을 실제 슬롯 값들로 대체한다. 언어 모델 형성기가 획득된 변형들을 사용하여 실시간으로 언어 모델을 형성한다. 동적 의사 캡슐(904)에서는, 의도 및 동작 결정 모듈이 동적 의도 라벨(902)과 액션 시퀀스 플래너(170c)로부터의 입력을 수신하고, 수신된 입력들에 기초한 예측을 메모리(110)(예: Bixby pseudo dynamic capsule)에 저장한다. The natural language synthesizer 170d includes a natural language generation task 901 , a dynamic intent label 902 , a dynamic language model 903 , and a dynamic pseudo capsule. ) 904 . The natural language generation task 901 includes the following modules, and a semantic feature builder accumulates all screen information and corresponding grammar tags. A command encoder uses grammar tags to obtain variants with slot place holders. A command decoder uses real-time screen information, ie, tag values and their substitutes, for the output variants of the command decoder. The command decoder obtains the final utterance variants. The dynamic intent label 902 includes the following modules, and the unsupervised similarity classifier is a module for determining similar utterances and classifying them into buckets. A dynamic label generator assigns a dynamic ID to a group of similar utterances within a bucket. The dynamic language model 903 includes the following modules, and the slot expansion module replaces the variants with real slot values to obtain a complete utterance. A language modeler forms a language model in real time using the obtained transforms. In the dynamic pseudo capsule 904 , the intent and action determination module receives input from the dynamic intent label 902 and the action sequence planner 170c and writes a prediction based on the received inputs to the memory 110 (eg, Bixby). stored in a pseudo dynamic capsule).

일 실시 예에서, 인터렉션 엔진(170)은 싱글 스텝 혹은 멀티 스텝에 기초한 단일/복수 의도 발화의 생성을 보조한다. 인터렉션 엔진(170)은 화면-텍스트로부터의 도메인 및 분류 정보와 함께 명명된 개체들을 결정하고, 이러한 정보를 오픈 도메인 자연어 발화들을 생성하는데 사용한다. 획득된 자연어 발화들은 동적 언어 모델로 분류 학습되며, VA(Voice Assistance) (예: 이 경우에는 빅스비)의 ASR(Audio Speech Recognition) 및 NLEPD(Natural Language based End Point Detection) 정확성을 향상시키는 데 더욱 도움을 줄 것이다. 또한, 인터렉션 엔진(170)은 동적으로 의도 예측, 액션 플래닝 및 실행을 위해 자연어 의도와 액션 페어들을 음성 보조 VA NLU 시스템에 대입한다. 자연어(NL), 자연어 분류 의도(NL Categorical Intent), 및 동적 언어 모델(Dynamic LM)과 같은 구조에서 획득되는 문맥 정보는 화면(140)상의 모든 변화 마다 새로 고침될 수 있다. 음성을 통해 화면 상 컨텐츠의 선택을 명확화하고, 서로 다른 애플리케이션들 상에서 유사한 제어 타입에 대한 명칭들을 연관시키고 구분하기 위해 다중 모달 정보(Multi-modal information)가 융합된다. In one embodiment, the interaction engine 170 assists in generating single/multiple intentional utterances based on single-step or multi-step. The interaction engine 170 determines named entities along with domain and classification information from the screen-text and uses this information to generate open domain natural language utterances. The acquired natural language utterances are classified and learned by a dynamic language model, and it is further used to improve the accuracy of Audio Speech Recognition (ASR) and Natural Language based End Point Detection (NLEPD) of VA (Voice Assistance) (eg, Bixby in this case). will help In addition, the interaction engine 170 dynamically substitutes natural language intent and action pairs into the voice-assisted VA NLU system for intent prediction, action planning, and execution. Context information obtained from structures such as natural language (NL), natural language classification intent (NL Categorical Intent), and dynamic language model (Dynamic LM) may be refreshed for every change on the screen 140 . Multi-modal information is fused to clarify the selection of on-screen content through voice, and to associate and distinguish names for similar control types on different applications.

인터렉션 엔진(170)은 자연어 이해(NL)를 의도들로 분류하고, 획득된 자연어 이해를 동작 루틴들과 연관시키며, 어떠한 애플리케이션에 대해서도 사용자의 음성 인터렉션을 명시적으로 학습해야 할 필요성을 제거한다. 또한, 화면 액션 시퀀스를 연관시키는 인터렉션 엔진(170)은 싱글 액션 혹은 멀티 액션 동작에 대한 단일/복수 의도 자연어를 확립하는 것을 돕는다. 획득된 분류된 자연어를 사용하여, 보다 적합한 음성-텍스트 (Speech to Text, STT) 인식에서의 ASR을 보조하고, 학습되지 않은 발화들에 대한 EPD(End Point Detection)를 결정하는 데 있어서 NLEPD(Natural Language based End Point Detection)를 보조하기 위한 동적 언어 모델이 획득된다.The interaction engine 170 classifies natural language understanding (NL) into intents, associating the obtained natural language understanding with operational routines, and eliminating the need to explicitly learn the user's voice interaction for any application. In addition, the interaction engine 170 associating a sequence of screen actions helps to establish a single/multiple intent natural language for a single action or a multi-action action. Using the obtained classified natural language to assist ASR in more appropriate speech to text (STT) recognition, and to determine EPD (End Point Detection) for unlearned utterances, NLEPD (Natural A dynamic language model to assist Language based End Point Detection) is obtained.

또한, 인터렉션 엔진(170)은 음성 인식 모듈, 예를 들어 Bixby ASR 모듈에서 보다 향상된 명명된 개체 인식을 사용하고, 동적으로 획득된 언어 모델을 사용하여 문맥적으로 EPD를 정확화한다. 또한, 인터렉션 엔진(170)은 화면 문맥 정보로부터의 dynamic NL Capsule development을 사용하므로, 명시적인 자연어 학습이 필요하지 않다. 또한, 인터렉션 엔진(170)은 화면 문맥 정보와 함께 다중 모달 융합 및, 화면 전이 시퀀스를 사용한 동적 액션 시퀀스를 사용하여 화면 제어 컨텐츠들을 명확화할 수 있다. In addition, the interaction engine 170 uses the better named object recognition in the speech recognition module, for example, the Bixby ASR module, and uses the dynamically acquired language model to contextually correct the EPD. In addition, since the interaction engine 170 uses dynamic NL Capsule development from screen context information, explicit natural language learning is not required. Also, the interaction engine 170 may disambiguate screen control contents using a dynamic action sequence using a multi-modal fusion and screen transition sequence together with screen context information.

본 개시의 실시 예들은 적어도 하나의 하드웨어 기기를 사용하고 네트워크 관리 기능들을 수행하여 구성 요소들을 제어함으로써 수행될 수 있다.Embodiments of the present disclosure may be performed by using at least one hardware device and performing network management functions to control components.

또한, 상술한 다양한 실시 예들에 따른 방법들은, 기존 전자 장치에 대한 소프트웨어 업그레이드, 또는 하드웨어 업그레이드 만으로도 구현될 수 있다. Also, the methods according to the above-described various embodiments may be implemented only by upgrading software or hardware of an existing electronic device.

또한, 상술한 본다양한 실시 예들은 전자 장치에 구비된 임베디드 서버, 또는 전자 장치의 외부 서버를 통해 수행되는 것도 가능하다. In addition, the various embodiments described above may be performed through an embedded server provided in the electronic device or an external server of the electronic device.

한편, 일시 예에 따르면, 이상에서 설명된 다양한 실시 예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)에 저장된 명령어를 포함하는 소프트웨어로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시 예들에 따른 전자 장치(예: 전자 장치(A))를 포함할 수 있다. 명령이 프로세서에 의해 실행될 경우, 프로세서가 직접, 또는 프로세서의 제어 하에 다른 구성요소들을 이용하여 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장 매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, '비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Meanwhile, according to a temporary example, the various embodiments described above may be implemented as software including instructions stored in a machine-readable storage media readable by a machine (eg, a computer). A device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include an electronic device (eg, the electronic device A) according to the disclosed embodiments. When the instruction is executed by the processor, the processor may perform a function corresponding to the instruction by using other components directly or under the control of the processor. Instructions may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

또한, 일 실시 예에 따르면, 이상에서 설명된 다양한 실시 예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory (CD-ROM))의 형태로, 또는 어플리케이션 스토어(예: 플레이 스토어TM)를 통해 온라인으로 배포될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.Also, according to an embodiment, the method according to the various embodiments described above may be included in a computer program product and provided. Computer program products may be traded between sellers and buyers as commodities. The computer program product may be distributed in the form of a machine-readable storage medium (eg, compact disc read only memory (CD-ROM)) or online through an application store (eg, Play Store™). In the case of online distribution, at least a portion of the computer program product may be temporarily stored or temporarily generated in a storage medium such as a memory of a server of a manufacturer, a server of an application store, or a relay server.

또한, 상술한 다양한 실시 예들에 따른 구성 요소(예: 모듈 또는 프로그램) 각각은 단수 또는 복수의 개체로 구성될 수 있으며, 전술한 해당 서브 구성 요소들 중 일부 서브 구성 요소가 생략되거나, 또는 다른 서브 구성 요소가 다양한 실시 예에 더 포함될 수 있다. 대체적으로 또는 추가적으로, 일부 구성 요소들(예: 모듈 또는 프로그램)은 하나의 개체로 통합되어, 통합되기 이전의 각각의 해당 구성 요소에 의해 수행되는 기능을 동일 또는 유사하게 수행할 수 있다. 다양한 실시 예들에 따른, 모듈, 프로그램 또는 다른 구성 요소에 의해 수행되는 동작들은 순차적, 병렬적, 반복적 또는 휴리스틱하게 실행되거나, 적어도 일부 동작이 다른 순서로 실행되거나, 생략되거나, 또는 다른 동작이 추가될 수 있다.In addition, each of the components (eg, a module or a program) according to the various embodiments described above may be composed of a single or a plurality of entities, and some sub-components of the aforementioned sub-components may be omitted, or other sub-components may be omitted. Components may be further included in various embodiments. Alternatively or additionally, some components (eg, a module or a program) may be integrated into a single entity, so that functions performed by each corresponding component prior to integration may be performed identically or similarly. According to various embodiments, operations performed by a module, program, or other component may be sequentially, parallelly, repetitively or heuristically executed, or at least some operations may be executed in a different order, omitted, or other operations may be added. can

이상에서는 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In the above, preferred embodiments have been illustrated and described, but the present disclosure is not limited to the specific embodiments described above, and those of ordinary skill in the art pertaining to the disclosure without departing from the gist of the claims. Various modifications are possible by the present disclosure, of course, and these modifications should not be individually understood from the technical spirit or perspective of the present disclosure.

100: 전자 장치 110: 메모리
120: 프로세서100: electronic device 110: memory
120: processor

Claims

In the control method of the electronic device 100,
identifying at least one user interface (UI) element included in the screen 140 of the electronic device 100 ;
identifying at least one characteristic of the identified UI element;
obtaining a database including natural language utterances obtained based on at least one characteristic of the identified UI element;
when a voice input is received, identifying whether an utterance of the received voice input matches the natural language utterance included in the acquired database; and
automatically accessing the at least one UI element when the utterance of the voice input is identified as matching the natural language utterance;
At least one characteristic of the identified UI element is:
The position of each UI element, the relative position of each UI element with respect to other UI elements, the function of each UI element, the capability of each UI element, the type of each UI element, or the A control method comprising at least one of an appearance.

According to claim 1,
The at least one UI element,
A control method comprising at least one of an executable UI element, a non-executable UI element, a text UI element, or a non-text UI element.

According to claim 1,
The step of identifying whether it matches the natural language utterance comprises:
first comparing the utterances of the voice input with natural language utterances corresponding to text UI elements included in the database;
second comparing the utterances of the voice input with natural language utterances corresponding to synonyms of text UI elements included in the database according to the first comparison result;
A third comparison step of comparing the utterance of the voice input with a natural language utterance corresponding to a shape of a non-text UI element included in the database according to the second comparison result;
and identifying whether the match is made based on at least one of the first comparison result, the second comparison result, and the third comparison result.

According to claim 1,
The step of identifying whether it matches the natural language utterance comprises:
obtaining a learning result of a neural network model stored in the database; and
and identifying whether the match is made based on at least one of the first comparison result, the second comparison result, or the third comparison result and the learning result.

The method of claim 1,
The step of identifying whether matching with the natural language utterance comprises:
Including; obtaining a matching score for at least one UI element included in the screen;
The step of automatically accessing the at least one UI element comprises:
obtaining a matching score for at least one UI element included in the screen;
Providing a guide UI for selecting any one of the plurality of UI elements when the matching score of each of the plurality of UI elements is equal to or greater than the first threshold value, and the difference between the matching scores of each of the plurality of UI elements is within the second threshold value step; and
When the matching score of one UI element is equal to or greater than the first threshold value and greater than the second threshold value than the other UI elements, executing the one UI element or requesting a user's execution confirmation for the one UI element Providing a guide UI; including, a control method.

6. The method of claim 5,
The step of automatically accessing the at least one UI element comprises:
When the matching score of all UI elements included in the screen is less than the first threshold value, at least one of a guide GUI notifying the corresponding information or a guide UI including at least one recommended UI element is provided.

According to claim 1,
Acquiring the database includes:
identifying a degree of similarity between at least one of a degree of similarity between each UI element based on the position of each UI element, a relative position of each UI element, a function of each UI element, a performance of each UI element, and a shape of each UI element;
obtaining a knowledge graph by clustering each UI element based on the identified similarity; and
Storing the acquired knowledge graph in the database; method comprising.

According to claim 1,
Acquiring the database includes:
Determining a textual representation of a non-text UI element from the visual characteristics and relative position of each UI element;

According to claim 1,
Acquiring the database includes:
obtaining a knowledge graph by mapping each UI element and predefined information corresponding to the performance of each UI element; and
Storing the acquired knowledge graph in the database; Containing, a control method.

According to claim 1,
Acquiring the database includes:
obtaining a knowledge graph by determining a predefined screen sequence transferred through an operable UI element on the screen; and
Storing the acquired knowledge graph in the database; Containing, a control method.

According to claim 1,
performing semantic translation on the knowledge graph to obtain natural language variations for at least one of a single-step intent and a multi-step intent;
identifying at least one action and at least one action sequence for at least one of the single-step intent and the multi-step intent using the obtained knowledge graph; and
dynamically generating a natural language model for predicting natural language utterance of the identified UI element by mapping the obtained natural language variant with the identified action and the identified action sequence.

12. The method of claim 11,
The step of performing the semantic translation comprises:
categorizing each UI element of the knowledge graph into at least one of a domain, a verb, a synonym, a slot, a slot type, a slot expressed as a texture, a performance, and a relative position on the screen 140 ; and
Obtaining the natural language variant based on the categorization; including a control method.

12. The method of claim 11,
The step of identifying the action and the action sequence comprises:
Information on the types of UI elements included on the screen 140 and performance corresponding thereto, performance-related verbs, performance on action information, an element sequence graph operable on the screen and corresponding action sequences, predefined actions and determining a predefined table set including an action sequence; and
Including; determining an action routine based on the performance of each UI element;
A control method, wherein a unique identity is assigned to the predefined actions and action sequences.

12. The method of claim 11,
The step of dynamically generating the natural language model comprises:
clustering the similar natural language variants;
assigning dynamic intent to the similar natural language variants;
associating the at least one identified action and the at least one identified sequence of actions with the dynamic intent;
dynamically generating the natural language model based on the clustered natural language variant, the dynamic intent, and the action routines; and
Storing the dynamically generated natural language model in the database; including a control method.

According to claim 1,
The step of identifying whether the natural language utterances match,
determining a text expression score by reading screen information included on the screen (140);
extracting synonyms for verbs and nouns from text expressions and text UI elements included in the screen information and assigning a synonym score;
determining a variance score correlating the natural language variants used for weighted learning for a dynamic language generator (170E);
determining a relevance score by comparing reference objects included in the utterance of the received speech input and proximity element information; and
determining a match score for the dynamic language generator (170E) as a final score that matches an utterance of the received speech input, and combining the determined match score with the relevance score.

According to claim 1,
The at least one characteristic is
A control method comprising at least one of a relative position of each UI element, a function of each UI element, a performance of each UI element, a type of each element, and a shape of each UI element.

According to claim 1,
The database is
A control method that is generated in real time based on the current screen 140 of the electronic device 100 .

In the electronic device 100,
memory 110; and
and a processor 120 connected to the memory 110 to control the electronic device 100;
The processor 120,
Identifies at least one UI (User Interface) element included in the screen 140 of the electronic device 100,
identify at least one characteristic of the identified UI element;
acquiring a database including natural language utterances obtained based on the at least one characteristic of the identified UI element;
When a voice input is received, it is identified whether the utterance of the received voice input matches the natural language utterance included in the acquired database,
automatically accessing the at least one UI element when the utterance of the voice input is identified as matching the natural language utterance;
At least one characteristic of the identified UI element is:
The position of each UI element, the relative position of each UI element with respect to other UI elements, the function of each UI element, the capability of each UI element, the type of each UI element, or the appearance of each UI element ) comprising at least one of, an electronic device.

19. The method of claim 18,
The processor is
First comparing the utterance of the voice input with natural language utterances corresponding to text UI elements included in the database,
comparing the utterance of the voice input with a natural language utterance corresponding to a synonym of a text UI element included in the database according to the first comparison result;
a third comparison of the speech input utterance with a natural language utterance corresponding to a shape of a non-text UI element included in the database according to the second comparison result;
Obtaining the learning result of the neural network model stored in the database,
The control method of identifying whether the match is based on at least one of the first comparison result, the second comparison result, or the third comparison result and the learning result.

19. The method of claim 18,
The processor is
obtaining a matching score for at least one UI element included in the screen;
When the matching score of each of the plurality of UI elements is equal to or greater than a first threshold value, and the difference between the matching scores of each of the plurality of UI elements is within a second threshold value, a guide UI for selecting any one of the plurality of UI elements is provided, and ,
When the matching score of one UI element is equal to or greater than the first threshold value and greater than the second threshold value than the other UI elements, executing the one UI element or requesting a user's execution confirmation for the one UI element An electronic device that provides a guide UI.