TWI795823B

TWI795823B - Electronic device and its operation method

Info

Publication number: TWI795823B
Application number: TW110123573A
Authority: TW
Inventors: 李宜佳; 姜禮誠
Original assignee: 仁寶電腦工業股份有限公司
Priority date: 2020-06-29
Filing date: 2021-06-28
Publication date: 2023-03-11
Also published as: TW202201179A

Abstract

An operation method of an electronic device is provided. The electronic device includes a display panel, an eye tracking module and a processor, wherein the processor is coupled to the display panel and the eye tracking module. The operation method includes: perform eye tracking based on the user's eye image to generate gaze image information by the eye tracking module; generate gaze coordinate information according to the gaze image information and the resolution of the display panel by the processor; perform semantic analysis on the voice information from the user to generate object information and task information by the processor; generate selected object information based on the object information and the gaze coordinate information by the processor; and execute the corresponding task action according to the selected object information and the task information by the processor. An electronic device for performing the operation method is also provided.

Description

Electronic device and method of operation thereof

本發明是有關於一種電子裝置及其操作方法，且特別是有關於一種可自動執行用戶給定任務的電子裝置及其操作方法。The present invention relates to an electronic device and its operating method, and in particular to an electronic device capable of automatically executing tasks given by users and its operating method.

電子商務是指透過網際網路或電子交易的方式進行交易活動以及相關服務活動。在電子商務上的應用上，需要由用戶點擊功能鍵的方式以進行相關操作。例如，以滑鼠點擊網頁上的功能鍵的以進行開啟指定商品頁面、將指定商品加入購物車等操作。因此，視所要達成的目的性不同，用戶需要來回在頁面上進行操作。然而，若用戶點擊到錯誤的功能鍵，則需要花費更多的時間才能達成原先的目的。另一種方式是以語音助理作輔助以進行操作。用戶可以透過語音來下達指令。然而，有可能發生語音助理判斷語意不清或是理解錯誤，導致無法正確完成任務的狀況。E-commerce refers to transaction activities and related service activities conducted through the Internet or electronic transactions. In e-commerce applications, it is necessary for the user to click a function key to perform related operations. For example, click the function button on the webpage with the mouse to perform operations such as opening the specified product page, adding the specified product to the shopping cart, and the like. Therefore, depending on the purpose to be achieved, the user needs to perform operations on the page back and forth. However, if the user clicks on the wrong function key, it will take more time to achieve the original purpose. Another way is to use the voice assistant as an assistant to operate. Users can give instructions by voice. However, it may happen that the voice assistant judges that the language is unclear or misunderstood, resulting in the situation that the task cannot be completed correctly.

因此，需要提出一種解決方案，以兼顧用戶操作的便利性以及執行任務的正確性。Therefore, a solution needs to be proposed to take into account the convenience of user operation and the correctness of performing tasks.

本發明提供一種電子裝置及其操作方法，可兼顧用戶操作的便利性以及執行任務的正確性。The invention provides an electronic device and its operation method, which can take into account the convenience of user operation and the correctness of task execution.

本發明提出一種電子裝置的操作方法，其中電子裝置包括顯示面板、眼球追蹤模組以及處理器，其中處理器耦接顯示面板與眼球追蹤模組。前述操作方法包括：眼球追蹤模組依據用戶的眼部影像進行眼球追蹤以產生注視影像信息；處理器依據注視影像信息與顯示面板的解析度，以產生注視座標信息；處理器對來自用戶的語音信息執行語意解析以產生物件信息與任務信息；處理器依據物件信息與注視座標信息以產生選定物件信息；以及處理器依據選定物件信息與任務信息，以執行相應的任務動作。The invention provides an operating method of an electronic device, wherein the electronic device includes a display panel, an eye tracking module, and a processor, wherein the processor is coupled to the display panel and the eye tracking module. The aforementioned operation method includes: the eye tracking module performs eye tracking according to the user's eye image to generate gaze image information; the processor generates gaze coordinate information according to the gaze image information and the resolution of the display panel; Information performs semantic analysis to generate object information and task information; the processor generates selected object information according to the object information and gaze coordinate information; and the processor executes corresponding task actions according to the selected object information and task information.

本發明提出一種電子裝置，包括顯示面板、眼球追蹤模組、語音輸入模組以及處理器。眼球追蹤模組用以產生注視影像信息。語音輸入模組用以接收語音信息。處理器耦接顯示面板、語音輸入模組與眼球追蹤模組。處理器用以：依據注視影像信息與顯示面板的解析度以產生注視座標信息；對語音信息執行語意解析以產生物件信息與任務信息；依據物件信息與注視座標信息以產生選定物件信息；以及依據選定物件信息與任務信息以執行相應的任務動作。The invention provides an electronic device, including a display panel, an eye tracking module, a voice input module and a processor. The eye tracking module is used to generate gaze image information. The voice input module is used for receiving voice information. The processor is coupled to the display panel, the voice input module and the eye tracking module. The processor is used to: generate gaze coordinate information according to the gaze image information and the resolution of the display panel; perform semantic analysis on the voice information to generate object information and task information; generate selected object information according to the object information and gaze coordinate information; Object information and task information to perform corresponding task actions.

本發明提出一種電子裝置的操作方法。電子裝置包括顯示面板、眼球追蹤模組、語音輸入模組以及處理器。前述操作方法包括：顯示面板播放影片；處理器依據語音輸入模組接收到的第一語音信息，擷取影片的當前畫面；處理器對第一語音信息執行第一語意解析，以產生第一物件信息與第一任務信息；眼球追蹤模組依據第一眼部影像進行第一眼球追蹤，以產生第一注視影像信息；處理器依據第一注視影像信息與顯示面板的解析度，以產生第一注視座標信息；處理器依據第一注視座標信息，決定當前畫面的局部畫面；處理器依據局部畫面，決定興趣物件；處理器依據興趣物件與第一物件信息，以產生第一選定物件信息；以及處理器依據第一選定物件信息與第一任務信息，以執行相應的第一任務動作。The invention provides an operating method of an electronic device. The electronic device includes a display panel, an eye tracking module, a voice input module and a processor. The aforementioned operation method includes: displaying the video on the display panel; the processor extracts the current frame of the video according to the first voice information received by the voice input module; the processor executes the first semantic analysis on the first voice information to generate the first object information and the first task information; the eye tracking module performs the first eye tracking according to the first eye image to generate the first gaze image information; the processor generates the first gaze image information according to the first gaze image information and the resolution of the display panel The gaze coordinate information; the processor determines the partial image of the current image according to the first gaze coordinate information; the processor determines the object of interest according to the partial image; the processor generates the first selected object information according to the object of interest and the first object information; and The processor executes the corresponding first task action according to the first selected object information and the first task information.

基於上述，本發明可以眼球追蹤來替代滑鼠游標，並搭配語音助理功能，以依據語意分析結果執行任務動作。藉此，可以避免用戶點擊錯誤以及語音助理理解錯誤的問題，兼顧了操作便利性以及任務執行的正確性。Based on the above, the present invention can replace the mouse cursor with eye tracking, and cooperate with the voice assistant function to perform task actions according to the semantic analysis results. In this way, the problems of wrong clicking by the user and wrong understanding of the voice assistant can be avoided, taking into account the convenience of operation and the correctness of task execution.

圖1繪示為本發明一實施例的電子裝置的方塊示意圖。圖2繪示為本發明一實施例的電子裝置的操作方法的步驟流程圖。請同時參見圖1與圖2，電子裝置100包括顯示面板110、眼球追蹤模組120、語音辨識模組130以及處理器140。顯示面板110用以顯示一畫面。眼球追蹤模組120用以依據用戶的眼部影像以測量用戶眼睛的注視點的位置或者眼球相對頭部的運動，藉此實現眼球追蹤（Eye Tracking）。眼球追蹤模組120執行眼球追蹤以產生注視影像信息（步驟S210）。語音輸入模組130用以接收來自用戶的語音信息。FIG. 1 is a schematic block diagram of an electronic device according to an embodiment of the present invention. FIG. 2 is a flow chart showing the steps of the operation method of the electronic device according to an embodiment of the present invention. Please refer to FIG. 1 and FIG. 2 at the same time. The electronic device 100 includes a display panel 110 , an eye tracking module 120 , a voice recognition module 130 and a processor 140 . The display panel 110 is used for displaying a frame. The eye tracking module 120 is used to measure the gaze point of the user's eyes or the movement of the eye relative to the head according to the user's eye image, thereby realizing eye tracking. The eye tracking module 120 performs eye tracking to generate gaze image information (step S210 ). The voice input module 130 is used for receiving voice information from the user.

用戶的眼部影像可由電子裝置100的攝像模組（未示出）進行拍攝。眼部影像可包括普爾欽（Purkinje）斑以及瞳孔影像。普爾欽斑是在光束入射眼球的過程中，在眼球角膜表面反射所形成。瞳孔影像可以是亮瞳（bright pupil）影像或暗瞳（dark pupil）影像。在一實施例中，眼球追蹤模組120可以是眼動儀，其係利用眼動過程的變動特徵（瞳孔）與保持不變的特徵（普爾欽斑）來進行眼球追蹤。在細節上，眼球追蹤模組120交替地使用不同方位的光源向人眼發出近紅外線，以分別於相鄰兩幀獲取亮瞳影像和暗瞳影像。瞳孔影像的位置可以透過對亮瞳影像與暗瞳影像進行疊加差分而獲得。注視影像信息包括普爾欽斑的位置、瞳孔影像的位置以及前述兩者的相對方向與相對距離。The user's eye image can be captured by a camera module (not shown) of the electronic device 100 . Eye images may include Purkinje spots as well as pupil images. Purchin's spots are formed when the beam of light enters the eyeball and is reflected on the surface of the cornea of the eyeball. The pupil image may be a bright pupil image or a dark pupil image. In one embodiment, the eye tracking module 120 may be an eye tracker, which uses the changing feature (pupil) and the constant feature (Purchin spot) of the eye movement process to track the eye. In detail, the eye tracking module 120 alternately uses light sources in different directions to emit near-infrared rays to human eyes, so as to obtain bright-pupil images and dark-pupil images in two adjacent frames respectively. The position of the pupil image can be obtained by superimposing and differencing the bright pupil image and the dark pupil image. The fixation image information includes the position of Purchin's spot, the position of the pupil image, and the relative direction and relative distance between the two.

處理器140耦接顯示面板110、眼球追蹤模組120以及語音輸入模組130。處理器140基於顯示面板110的解析度，以依據注視影像信息來產生注視座標信息（步驟S220）。處理器140並可對前述語音信息進行語意解析，以產生對應的物件信息與任務信息（步驟S230）。處理器140還可依據物件信息與注視座標信息產生選定物件信息（步驟S240），並依據選定物件信息與任務信息來執行相應的任務動作（步驟S250）。The processor 140 is coupled to the display panel 110 , the eye tracking module 120 and the voice input module 130 . The processor 140 generates gaze coordinate information according to the gaze image information based on the resolution of the display panel 110 (step S220 ). The processor 140 can perform semantic analysis on the aforementioned voice information to generate corresponding object information and task information (step S230 ). The processor 140 can also generate selected object information according to the object information and gaze coordinate information (step S240 ), and execute corresponding task actions according to the selected object information and task information (step S250 ).

圖3繪示為本發明第一實施例的電子裝置的操作方法的步驟流程圖。圖4繪示為本發明第一實施例的使用情境的示意圖。請同時參照圖1、圖3與圖4，在步驟S310中，由電子裝置100將網頁資訊顯示於顯示面板110（例如商品頁面）。在步驟S320中，當用戶的視線落在前述頁面的目標401上時，可由眼球追蹤模組120依據用戶眼睛朝向方向D1的眼部影像以產生注視影像信息。在步驟S330中，由處理器140依據顯示面板110的解析度以及注視影像信息以產生注視座標信息。注視座標信息可以指示用戶視線落點對應在顯示區域上的一個特定位置。在步驟S340中，由處理器140依據注視座標信息喚醒語音助理，並依照語音輸入模組130是否接收到來自用戶的語音信息進行不同操作（步驟S350）。若未接收到用戶的語音信息，則回到步驟S320。若接收到用戶的語音信息，則處理器140將語音信息轉化為文字信息，並進一步依據文字信息進行語意解析，以產生物件信息以及任務信息（步驟S360）。FIG. 3 is a flow chart showing the steps of the operation method of the electronic device according to the first embodiment of the present invention. FIG. 4 is a schematic diagram of a usage scenario of the first embodiment of the present invention. Please refer to FIG. 1 , FIG. 3 and FIG. 4 at the same time. In step S310 , the electronic device 100 displays webpage information on the display panel 110 (such as a product page). In step S320 , when the user's gaze falls on the target 401 on the aforementioned page, the eye tracking module 120 can generate gaze image information according to the eye image of the user's eyes facing the direction D1 . In step S330 , the processor 140 generates gaze coordinate information according to the resolution of the display panel 110 and the gaze image information. The gaze coordinate information may indicate that the user's sight line corresponds to a specific position on the display area. In step S340, the processor 140 wakes up the voice assistant according to the gaze coordinate information, and performs different operations according to whether the voice input module 130 receives voice information from the user (step S350). If the user's voice information is not received, go back to step S320. If the user's voice information is received, the processor 140 converts the voice information into text information, and further performs semantic analysis based on the text information to generate object information and task information (step S360 ).

在細節上，可由一語音辨識模組（圖未示）依據語音辨識演算法（例如語音轉文字（Speech To Text, STT）識別）來辨識用戶的語音信息以產生文字信息。語意解析可由一語意解析模組（圖未示）來進行。舉例來說，語意解析模組可透過自然語言理解（Natural Language Understanding，NLU）演算法以基於文字信息來產生解析結果。自然語言理解演算法的作用在於讓系統能夠讀懂前述文字信息，使其理解文本、語言並提取資訊，以幫助文本分類、語法分析、資訊搜索等下游任務的執行。自然語言理解首先是要從文字信息辨識出字詞並找區分出不同的詞性。接著，可以依據概率語言模型來預測可能接續在後的字詞。常見的概率語言模型包括N元語法（N-gram）模型、遞迴神經網路（Recurrent Neural Network，RNN）模型以及序列到序列（Seq2Seq）模型。在一實施例中，也可以使用序列到序列暨專注機制（Attention Mechanism）模型。詞嵌入（word embedding）是概率語言模型最常見的訓練方式。可將訓練好的字詞通過讀熱編碼（One Hot Encoding）後輸入概率語言模型。In detail, a speech recognition module (not shown) can recognize the user's speech information according to a speech recognition algorithm (such as speech-to-text (Speech To Text, STT) recognition) to generate text information. Semantic analysis can be performed by a semantic analysis module (not shown). For example, the semantic analysis module can generate an analysis result based on text information through a Natural Language Understanding (NLU) algorithm. The function of the natural language understanding algorithm is to enable the system to understand the aforementioned text information, so that it can understand the text, language and extract information to help the execution of downstream tasks such as text classification, syntax analysis, and information search. The first step in natural language understanding is to identify words from text information and to distinguish different parts of speech. Then, the probabilistic language model can be used to predict the words that may follow. Common probabilistic language models include N-gram (N-gram) models, recurrent neural network (Recurrent Neural Network, RNN) models, and sequence-to-sequence (Seq2Seq) models. In an embodiment, a sequence-to-sequence and attention mechanism (Attention Mechanism) model may also be used. Word embedding is the most common training method for probabilistic language models. The trained words can be input into the probabilistic language model through One Hot Encoding.

詞向量（word to vector，Word2Vec）是由Google團隊所提出的詞嵌入模型，藉由機器學習訓練大量的文章以使用向量來代表單字的意思。由於單字語意的資訊量十分龐大，可使用要素分析（principal component analysis，PCA）進行降維。藉由忽略向量中資訊量較少的維度，來達到縮減維度的效果。透過自然語言理解可對前述文字信息進行意圖識別以及實體提取。其中，實體提取的結果為物件信息，且意圖識別的結果為任務信息。舉例來說，用戶的語音信息可以是「將這件衣服加入我的最愛」、「將這件衣服加入購物車」或「搜尋關於這件衣服的更多資訊」。可以通過語音辨識以及語意解析以產生物件信息（衣服）與任務信息（加入我的最愛、加入購物車或搜尋更多資訊）。Word to vector (Word2Vec) is a word embedding model proposed by the Google team, which uses machine learning to train a large number of articles to use vectors to represent the meaning of words. Due to the huge amount of semantic information of a single word, principal component analysis (PCA) can be used for dimensionality reduction. Dimensionality reduction is achieved by ignoring dimensions with less information in the vector. Through natural language understanding, intent recognition and entity extraction can be performed on the aforementioned text information. Among them, the result of entity extraction is object information, and the result of intent recognition is task information. For example, the user's voice message can be "add this dress to my favorite", "add this dress to the shopping cart", or "search for more information about this dress". Object information (clothes) and task information (add to favorites, add to shopping cart, or search for more information) can be generated through speech recognition and semantic analysis.

在本實施例中，語音辨識與語意解析的工作可由雲端伺服器（圖未示）來執行。例如可透過LUIS（Language Understanding Intelligent Service）或Google Dialogflow等雲端服務來進行。也就是說，由處理器140將接收到的用戶語音信息傳輸到雲端伺服器，再從雲端伺服器接收最終的解析結果。然而本發明不以此為限，在一實施例中，語音辨識的工作可在本地端（電子裝置100）執行，並且語意解析的工作可在雲端伺服器來執行。甚至，在一實施例中，語音辨識以及語意解析的工作都在本地端執行。In this embodiment, the voice recognition and semantic analysis can be performed by a cloud server (not shown). For example, it can be done through cloud services such as LUIS (Language Understanding Intelligent Service) or Google Dialogflow. That is to say, the processor 140 transmits the received voice information of the user to the cloud server, and then receives the final analysis result from the cloud server. However, the present invention is not limited thereto. In one embodiment, the voice recognition work can be performed locally (the electronic device 100 ), and the semantic analysis work can be performed on a cloud server. Even, in an embodiment, the work of speech recognition and semantic analysis is performed locally.

在步驟S370中，由處理器140依據注視座標信息以及物件信息以產生選定物件信息。以圖4為例，由於用戶視線落點在目標401上，因此處理器140可以依據目標401於顯示面板110的顯示區域上的位置（即注視座標信息）以及物件信息（例如衣服），而產生選定物件信息。舉例來說，處理器140可以依據注視座標信息鎖定網頁資訊的一圖片資訊，並依據前述圖片資訊以及物件信息產生選定物件信息。簡單來說，以將網頁商品加入購物車的操作過程為例，可以透過偵測用戶視線移動以取代手動移動游標的動作。當視線停止在一處時觸發語音助理功能。並且，透過由用戶下達語音信息來取代手動點擊滑鼠左鍵以將商品加入購物車的動作。In step S370, the processor 140 generates selected object information according to the gaze coordinate information and the object information. Taking FIG. 4 as an example, since the user's line of sight falls on the target 401, the processor 140 can generate Selected object information. For example, the processor 140 may lock a picture information of the webpage information according to the gaze coordinate information, and generate selected object information according to the aforementioned picture information and object information. To put it simply, taking the operation process of adding web products to the shopping cart as an example, the action of manually moving the cursor can be replaced by detecting the movement of the user's gaze. Trigger the voice assistant function when the line of sight stops in one place. Moreover, the action of manually clicking the left mouse button to add the product to the shopping cart is replaced by the voice message issued by the user.

請重新參照圖3，在步驟S380中，由處理器140依據選定物件信息以及任務信息已執行相應的任務動作。任務動作包括但不限於將選定物件信息對應的商品放入線上購物車、將選定物件信息對應的網址加入我的最愛、在網際網路上找尋選定物件信息的資訊（如尺寸、顏色等資訊）、於多個購物網站中找尋包含對應商品且具有最高相關度的購物網站，以及於多個購物網站中找尋選定物件信息對應的商品的價格資訊以進行比價並找出具有最低價格的購物網站。在執行完選購的動作後，由處理器140產生反饋信息（步驟S390），如商品價格、商品資訊、購買頁面等。在一實施例中，可以由處理器140控制揚聲器（圖未示）發出聲音以提示用戶任務已完成。在一實施例中，可以由處理器140控制顯示面板110顯示文字和/或圖片以提示用戶任務已完成。Please refer to FIG. 3 again, in step S380, the processor 140 has executed the corresponding task action according to the selected object information and the task information. Task actions include but are not limited to putting the product corresponding to the selected object information into the online shopping cart, adding the URL corresponding to the selected object information to my favorites, searching for information about the selected object information on the Internet (such as size, color, etc.), Finding the shopping website containing the corresponding product and having the highest correlation degree among multiple shopping websites, and looking for the price information of the product corresponding to the selected item information among the multiple shopping websites to compare prices and find out the shopping website with the lowest price. After the purchase action is completed, the processor 140 generates feedback information (step S390 ), such as product price, product information, purchase page and so on. In an embodiment, the processor 140 may control a speaker (not shown) to emit a sound to remind the user that the task is completed. In one embodiment, the processor 140 may control the display panel 110 to display text and/or pictures to remind the user that the task is completed.

圖5繪示為本發明第二實施例的電子裝置的操作方法的步驟流程圖。圖6繪示為本發明第二實施例的使用情境的示意圖。請同時參照圖1、圖5與圖6，在步驟S501中，透過顯示面板110播放影片。步驟S502~S505的相關細節可以參照圖3所示第一實施例的步驟S320~350的說明，於此不再贅述。在步驟S506中，在語音輸入模組130接收到來自用戶的語音信息時，由處理器140擷取當前畫面。在此之前，用戶需手動暫停影片播放，以免處理器140擷取到非目標畫面。在步驟S507中，由處理器140依據注視座標信息決定當前畫面的局部畫面502，其中前述局部畫面502的面積小於當前畫面。並且，由處理器140依據局部畫面502決定興趣物件。在步驟S508中，由處理器140將語音信息轉化為文字信息，並進一步依據文字信息進行語意解析，以產生物件信息以及任務信息。在步驟S509中，由處理器140依據興趣物件以及物件信息以產生選定物件信息。步驟S510與步驟S511的細節可以分別參考圖3的步驟S380與步驟S390的說明，於此不再贅述。FIG. 5 is a flow chart showing the steps of the operation method of the electronic device according to the second embodiment of the present invention. FIG. 6 is a schematic diagram of a usage scenario of the second embodiment of the present invention. Please refer to FIG. 1 , FIG. 5 and FIG. 6 at the same time. In step S501 , a video is played through the display panel 110 . For the relevant details of steps S502-S505, reference may be made to the description of steps S320-350 of the first embodiment shown in FIG. 3 , which will not be repeated here. In step S506, when the voice input module 130 receives the voice information from the user, the processor 140 captures the current frame. Before this, the user needs to manually pause the video playback to prevent the processor 140 from capturing non-target frames. In step S507, the processor 140 determines the partial frame 502 of the current frame according to the gaze coordinate information, wherein the area of the partial frame 502 is smaller than the current frame. Moreover, the processor 140 determines the object of interest according to the partial frame 502 . In step S508, the processor 140 converts the voice information into text information, and further performs semantic analysis based on the text information to generate object information and task information. In step S509, the processor 140 generates selected object information according to the interest object and the object information. For details of step S510 and step S511 , please refer to the description of step S380 and step S390 in FIG. 3 , which will not be repeated here.

請同時參考圖1與圖6，在一操作情境中，當用戶暫停播放影片，並注視影面人物的衣服說出「這件衣服要去哪裡購買」時，處理器140可以依據語音信息解析出「衣服」、「要去」、「哪裡」等物件信息及「找尋網站」的任務信息。並且，處理器140可決定興趣物件503。處理器140並可對興趣物件503的色彩進行識別。並且，可透過邊緣檢測以及形狀來判斷袖子樣式、領口樣式以及衣服輪廓。如此一來，可以得到例如「白色」、「長袖」、「圓領」、「上衣」等關鍵詞。處理器140可依據前述多個關鍵詞查找相關程度最高的商品販售頁面，並透過顯示面板110顯示該商品販售頁面。此外，處理器140可經由文字、圖片以及語音當中至少一種方式來提示使用者「這件衣服可以在這裡購買到」。Please refer to FIG. 1 and FIG. 6 at the same time. In an operation scenario, when the user pauses the video and looks at the clothes of the characters in the video and says "Where do I want to buy this clothes?", the processor 140 can analyze the audio information according to the Object information such as "clothes", "going to", "where" and task information of "finding a website". Moreover, the processor 140 can determine the interest object 503 . The processor 140 can also identify the color of the object of interest 503 . Moreover, the sleeve style, neckline style and clothing outline can be judged through edge detection and shape. In this way, keywords such as "white", "long sleeve", "round neck", and "top" can be obtained. The processor 140 can search for the product sales page with the highest degree of relevance according to the aforementioned keywords, and display the product sales page through the display panel 110 . In addition, the processor 140 may prompt the user "this dress can be purchased here" through at least one of text, picture and voice.

具體來說，當用戶視線落點位在目標501，處理器140可依據注視座標信息決定當前畫面的局部畫面502。處理器140並可決定局部畫面502中興趣物件503。處理器140可基於注視座標信息進行擴展以產生具有預定尺寸的局部畫面502。在一實施例中，處理器140可依據注視座標信息以將用戶視線落點做為中心進行擴展。除了產生具有預定尺寸的局部畫面502之外，也可以採用動態決定局部畫面502的範圍。在一實施例中，當處理器140無法依據局部畫面502決定興趣物件503時（可能為局部畫面502尺寸過小所導致），由處理器140增加局部畫面502的尺寸。Specifically, when the user's gaze falls on the target 501 , the processor 140 may determine the partial frame 502 of the current frame according to the gaze coordinate information. The processor 140 can also determine the interest object 503 in the partial frame 502 . The processor 140 may perform expansion based on the gaze coordinate information to generate a partial frame 502 with a predetermined size. In one embodiment, the processor 140 may use the gaze point of the user as the center for expansion according to the gaze coordinate information. In addition to generating the partial picture 502 with a predetermined size, the range of the partial picture 502 can also be determined dynamically. In one embodiment, when the processor 140 cannot determine the object of interest 503 according to the partial frame 502 (possibly because the size of the partial frame 502 is too small), the processor 140 increases the size of the partial frame 502 .

在決定興趣物件的細節上，在一實施例中，處理器140可依據注視座標信息所對應的視線落點的像素與相鄰的多個像素來決定興趣物件503，其中注視座標信息所對應的像素的顏色資訊與前述相鄰的多個像素的顏色資訊相近。以圖5為例，處理器140可以基於目標501的像素的顏色資訊（例如白色）找尋相鄰的多個具有相近顏色資訊的像素，以此決定興趣物件503。Regarding the details of determining the object of interest, in one embodiment, the processor 140 can determine the object of interest 503 according to the pixel of the gaze point corresponding to the gaze coordinate information and a plurality of adjacent pixels, wherein the gaze coordinate information corresponds to The color information of the pixel is similar to the color information of the aforementioned adjacent pixels. Taking FIG. 5 as an example, the processor 140 can find a plurality of adjacent pixels with similar color information based on the color information (eg, white) of the pixel of the object 501 to determine the object of interest 503 .

另外，處理器140可運用深度學習技術對局部畫面502做影像辨識以決定興趣物件，其類型包括影像分類（Classification）、分類與定位（Classification + Localization）、物件偵測（object detection）以及實例分割（Instance Segmentation）。影像分類是從來源影像去進行分類且一張影像只會被判別為一種類別。分類與定位可標註單一物體（Single Object）所在的位置及大小。物體偵測可進一步標註多個物體（Multiple Object）所在的位置及大小。實體切割可標註實體（Instance），其中同一類的物體可以區分各別的位置及大小，尤其適用於物體之間有重疊的狀況。In addition, the processor 140 can use deep learning technology to perform image recognition on the partial frame 502 to determine the object of interest, and its types include image classification (Classification), classification and localization (Classification + Localization), object detection (object detection) and instance segmentation (Instance Segmentation). Image classification is to classify from source images and an image will only be classified as one category. Classification and positioning can mark the position and size of a single object (Single Object). Object detection can further mark the location and size of multiple objects. Entity cutting can mark entities (Instance), in which objects of the same type can distinguish their respective positions and sizes, especially suitable for situations where objects overlap.

圖7A、7B、8A與8B皆繪示為本發明運用深度學習技術對局部畫面做影像辨識的示意圖。請同時參見圖1與圖7A，在一實施例中，處理器140可基於局部畫面701以透過影像分類來決定興趣物件702。透過大量已分類影像作為訓練資料的來源，訓練好的深度學習模型能夠快速且準確地進行分類。其中，可以採用邏輯斯回歸（Logistic Regression）作為分類模型，透過激勵函數（例如Sigmoid Activation Function）來判斷各類別的機率大小，並採用機率最大的類別。並且，處理器140可依據分類結果來決定興趣物件702。請同時參見圖1與圖7B，在一實施例中，處理器140可對局部畫面701進行分類，並依據注視座標信息進行定位，藉此決定興趣物件702。物件定位需要輸出定界框（bounding box）703的四個參數的預測值，且目標向量包括定界框資訊的多維矩陣。7A, 7B, 8A and 8B are schematic diagrams of the present invention using deep learning technology to perform image recognition on partial images. Please refer to FIG. 1 and FIG. 7A at the same time. In one embodiment, the processor 140 may determine the object of interest 702 through image classification based on the partial frame 701 . By using a large number of classified images as the source of training data, the trained deep learning model can quickly and accurately classify. Among them, Logistic Regression (Logistic Regression) can be used as the classification model, and the probability of each category can be judged through the activation function (such as Sigmoid Activation Function), and the category with the highest probability can be used. Moreover, the processor 140 can determine the interest object 702 according to the classification result. Please refer to FIG. 1 and FIG. 7B at the same time. In one embodiment, the processor 140 can classify the partial images 701 and perform positioning according to the gaze coordinate information, thereby determining the object of interest 702 . The object location needs to output the predicted values of the four parameters of the bounding box (bounding box) 703 , and the target vector includes a multi-dimensional matrix of bounding box information.

請同時參見圖1與圖8A，在一實施例中，處理器140可基於局部畫面801以透過物件偵測來決定興趣物件。物件偵測涉及到定界框與物件標籤。物件偵測的目標向量除了分類標籤編碼的一維向量外還需加上定界框的四個參數預測值。來源影像僅需經過前處理（pre-process）就能提供給深度學習模型並生辨識結果，包括物件位置、物件大小、物件種類、辨識分數等。與影像分類以及分類與定位不同的是，物件偵測可將分類與定位應用在多個物件（如物件802~805）上，而不僅限於單一物件。請同時參見圖1與圖8B，在一實施例中，處理器140可基於局部畫面801以透過實體切割結果來決定興趣物件。實體切割的目標是完整的切割出物件的輪廓，而不僅僅是位置和一個粗略的範圍。基於實體切割此目的而衍生出的遮罩區域卷積類神經網路（Mask R-CNN）成了這類深度學習模型的基底，例如mask_inception、mask_resnet皆為目前常見的模型。其中，若經過實例分割產生的物件有多個時（如物件802~805），由處理器140選擇當中的一個做為興趣物件。在本實施例中，處理器140可例如包括一個或多個的中央處理器(Central Processing Unit，CPU)、微處理器(Microprocessor Control Unit，MCU)或現場可程式閘陣列(Field Programmable Gate Array，FPGA)等諸如此類的處理電路或控制電路所組成，但本發明並不以此為限。Please refer to FIG. 1 and FIG. 8A at the same time. In one embodiment, the processor 140 may determine the object of interest through object detection based on the partial frame 801 . Object detection involves bounding boxes and object labels. In addition to the one-dimensional vector encoded by the classification label, the target vector of object detection needs to add four parameter prediction values of the bounding box. The source image only needs to be pre-processed to provide the deep learning model and generate recognition results, including object position, object size, object type, recognition score, etc. Different from image classification and classification and localization, object detection can apply classification and localization to multiple objects (such as objects 802-805), not limited to a single object. Please refer to FIG. 1 and FIG. 8B at the same time. In one embodiment, the processor 140 can determine the object of interest based on the partial frame 801 through the entity segmentation result. The goal of solid cutting is to completely cut out the outline of the object, not just the position and a rough range. The mask region convolutional neural network (Mask R-CNN) derived from the purpose of entity cutting has become the basis of this type of deep learning model, such as mask_inception and mask_resnet are currently common models. Wherein, if there are multiple objects (such as objects 802 - 805 ) generated through instance segmentation, the processor 140 selects one of them as the object of interest. In this embodiment, the processor 140 may include, for example, one or more central processing units (Central Processing Unit, CPU), microprocessors (Microprocessor Control Unit, MCU) or field programmable gate arrays (Field Programmable Gate Array, FPGA) and the like processing circuit or control circuit, but the present invention is not limited thereto.

本發明還可先執行第二實施例的步驟再執行第一實施例的步驟，並以此做為第三實施例。關於第一實施例與第二實施例的細節可以分別參考圖3與圖5的說明，於此不再贅述。舉例來說，當圖5所示步驟執行結束後，一商品頁面被開啟。此時，用戶可以依循圖3所示第一實施例的步驟，以透過眼球追蹤以及在語音助理的輔助下，快速且準確地對目標商品執行動作任務（例如加入購物車）。In the present invention, the steps of the second embodiment can be executed first, and then the steps of the first embodiment can be executed, and this can be regarded as the third embodiment. For the details of the first embodiment and the second embodiment, reference may be made to the descriptions of FIG. 3 and FIG. 5 respectively, which will not be repeated here. For example, after the steps shown in FIG. 5 are executed, a product page is opened. At this point, the user can follow the steps of the first embodiment shown in FIG. 3 to quickly and accurately perform action tasks (such as adding to a shopping cart) on the target product through eye tracking and with the assistance of the voice assistant.

綜上所述，本發明可以眼球追蹤來替代滑鼠游標，並搭配語音助理功能，以依據語意分析結果執行任務動作。藉此，可以避免用戶點擊錯誤的情形。並且，也可以減少或避免語音助理判斷語意不清或是理解錯誤的問題。如此一來，可在不中斷用戶瀏覽操作的前提下快速且準確地執行任務動作，兼顧操作便利性以及任務執行的正確性。To sum up, the present invention can replace the mouse cursor with eye tracking, and cooperate with the voice assistant function to perform tasks according to the semantic analysis results. In this way, it is possible to avoid the situation that the user clicks wrongly. In addition, it is also possible to reduce or avoid the problem that the voice assistant judges unclear meaning or misunderstands. In this way, the task action can be executed quickly and accurately without interrupting the user's browsing operation, taking into account the convenience of operation and the correctness of task execution.

100:電子裝置 110:顯示面板 120:眼球追蹤模組 130:語音辨識模組 140:處理器 401:目標 501:目標 502:局部畫面 503:興趣物件 701:局部畫面 702:興趣物件 703:定界框 801:局部畫面 802~805:物件 D1:方向 S210~S250、S310~S390、S501~S511:步驟100: Electronic device 110: display panel 120:Eye Tracking Module 130:Speech recognition module 140: Processor 401: target 501: target 502: Partial screen 503: object of interest 701: Partial screen 702: object of interest 703: bounding box 801: Partial screen 802~805: object D1: Direction S210~S250, S310~S390, S501~S511: steps

圖1繪示為本發明一實施例的電子裝置的方塊示意圖。圖2繪示為本發明一實施例的電子裝置的操作方法的步驟流程圖。圖3繪示為本發明第一實施例的電子裝置的操作方法的步驟流程圖。圖4繪示為本發明第一實施例的使用情境的示意圖。圖5繪示為本發明第二實施例的電子裝置的操作方法的步驟流程圖。圖6繪示為本發明第二實施例的使用情境的示意圖。圖7A、7B、8A與8B皆繪示為本發明運用深度學習技術對局部畫面做影像辨識的示意圖。FIG. 1 is a schematic block diagram of an electronic device according to an embodiment of the present invention. FIG. 2 is a flow chart showing the steps of the operation method of the electronic device according to an embodiment of the present invention. FIG. 3 is a flow chart showing the steps of the operation method of the electronic device according to the first embodiment of the present invention. FIG. 4 is a schematic diagram of a usage scenario of the first embodiment of the present invention. FIG. 5 is a flow chart showing the steps of the operation method of the electronic device according to the second embodiment of the present invention. FIG. 6 is a schematic diagram of a usage scenario of the second embodiment of the present invention. 7A, 7B, 8A and 8B are schematic diagrams of the present invention using deep learning technology to perform image recognition on partial images.

S210~S250:步驟S210~S250: steps

Claims

An operation method of an electronic device, wherein the electronic device includes a display panel, an eye tracking module, and a processor, the processor is coupled to the display panel and the eye tracking module, the operation method includes: the eye tracking module The group performs eyeball tracking according to a user's eye image to generate gaze image information; the processor generates gaze coordinate information according to the gaze image information and a resolution of the display panel; the processor generates gaze coordinate information from A voice information of the user performs a semantic analysis to generate object information and a task information; the processor generates selected object information according to the object information and the gaze coordinate information; and the processor generates selected object information according to the selected object information The task information is used to perform a corresponding task action, wherein the display panel is used to play a movie, and the step of generating the selected object information further includes: when a voice input module of the electronic device receives the voice information, The processor captures a current frame of the video; the processor determines a partial frame of the current frame according to the gaze coordinate information, and the partial frame is smaller than the current frame; the processor determines an interest based on the partial frame object; and the processor generates the selected object according to the interest object and the object information information.

The operating method of the electronic device according to claim 1, further comprising: the processor wakes up a voice assistant to receive the voice information from the user according to the gaze coordinate information.

The method for operating an electronic device as described in claim 1, after performing the task action, further comprising: notifying the user of a feedback information generated by the processor performing the task action in at least one of text, picture and voice .

The operating method of the electronic device as claimed in claim 1, wherein the eye image includes a Purchin's spot and a pupil image.

The operation method of the electronic device according to claim 4, wherein the gaze image information includes the position of the Purchin's spot and the pupil image, and their relative directions and distances.

The operating method of the electronic device according to claim 5, wherein the pupil image includes a bright pupil image and a dark pupil image of the user.

The operating method of the electronic device according to claim 6, wherein the position of the pupil image is obtained by superimposing and differencing the bright pupil image and the dark pupil image.

The operating method of an electronic device as claimed in claim 1, wherein the step of performing the semantic analysis on the voice information from the user further includes: converting the voice information into a text message by the processor; and the processor according to the Text information, perform the semantic analysis.

The operation method of the electronic device as claimed in claim 8, wherein the text information is converted in a cloud server.

The operating method of the electronic device as claimed in claim 8, wherein the semantic analysis is executed on a cloud server.

The operating method of an electronic device as claimed in claim 8, wherein the semantic analysis is performed through a natural language understanding algorithm.

The operating method of an electronic device as claimed in claim 11, wherein the natural language understanding algorithm includes an intention recognition and an entity extraction.

The operating method of an electronic device as claimed in claim 12, wherein the object information is extracted by the entity, and the task information is identified by the intent.

The operation method of the electronic device according to claim 1, wherein the task information is adding to a shopping cart, and the task action is adding a commodity corresponding to the selected item information to the shopping cart.

The operation method of the electronic device according to claim 1, wherein the task information is to add to favorites, and the task action is to add the URL corresponding to the selected object information to favorites.

The operation method of the electronic device as claimed in item 1, wherein the task information is searching information, and the task action is searching information of the selected object information on the Internet.

The operation method of the electronic device according to claim 1, wherein the task information is price comparison on a website, and the task action includes: Finding price information of a commodity corresponding to the selected object information among multiple shopping websites; and finding a shopping website with the lowest price among the multiple shopping websites.

The operation method of the electronic device as described in Claim 1, wherein the task information is searching for a website, and the task action includes: searching for the corresponding commodity according to the selected item information in multiple shopping websites; Among the shopping websites, a shopping website with the highest correlation is found.

The operation method of the electronic device as described in Claim 1, wherein the display panel is used to display a webpage, and the step of generating the selected object information further includes: the processor locks a picture information of the webpage according to the gaze coordinate information; And the processor generates the selected object information according to the image information and the object information.

The operating method of the electronic device according to claim 1, wherein when the voice input module receives the voice information, the step of the processor capturing the current frame of the video further includes: pausing playing the video.

The operation method of the electronic device as described in Claim 1, wherein the step of the processor determining the partial frame of the current frame according to the gaze coordinate information further includes: the processor expands according to the gaze coordinate information to generate the partial painting plane, wherein the partial frame has a predetermined size.

The operating method of the electronic device as claimed in claim 21, wherein when the processor cannot determine the object of interest according to the partial image, the size of the partial image is increased.

The operating method of an electronic device as described in claim 21, wherein the step of the processor determining the object of interest according to the partial image further includes: the processor according to a pixel corresponding to the gaze coordinate information, whose color information is similar Multiple adjacent pixels determine the object of interest.

The operation method of the electronic device as described in claim 1, wherein the step of the processor determining the object of interest according to the partial screen further includes: the processor performs a classification on the partial screen; the processor classifies the partial screen according to the classification As a result, the object of interest is determined.

The operating method of the electronic device as described in Claim 1, wherein the step of the processor determining the object of interest according to the partial image further includes: the processor performing a classification on the partial image; the processor according to the gaze coordinates information to perform a positioning; and the processor determines the interest object according to the classification result and the positioning result.

The operating method of the electronic device as described in Claim 1, wherein the step of the processor determining the object of interest according to the partial image further includes: the processor performs an object detection on the partial image; the processor performs an object detection according to the partial image The result of object detection, get multiple objects; The processor determines an object among the plurality of objects to be the object of interest according to the object information.

The operation method of the electronic device as described in Claim 1, wherein the step of determining the object of interest according to the partial image further includes: the processor performing an instance segmentation on the partial image; the processor according to the result of the instance segmentation, A plurality of objects are acquired; the processor determines an object among the plurality of objects to be the object of interest according to the object information.

An electronic device, comprising: a display panel; an eyeball tracking module, used to generate a gaze image information; a voice input module, used to receive a voice information; and a processor, coupled to the display panel, voice input The module and the eye tracking module, the processor is used to: generate gaze coordinate information according to the gaze image information and a resolution of the display panel; perform a semantic analysis on a voice information to generate object information and A task information; according to the object information and the gaze coordinate information, to generate a selected object information; and according to the selected object information and the task information, to execute a corresponding task action, wherein the display panel is used to play a movie, When the voice input module of the electronic device receives the voice information, the processor captures a current frame of the movie, and the processor determines a partial frame of the current frame according to the gaze coordinate information, and the partial frame The frame is smaller than the current frame, the processor determines an object of interest according to the partial frame, and the processor generates the selected object information according to the object of interest and the object information.

An operation method of an electronic device, wherein the electronic device includes a display panel, an eyeball tracking module, a voice input module and a processor, the operation method includes: the display panel plays a video; the processor according to the voice Inputting a first voice information received by the module to retrieve a current frame of the film; the processor performs a first semantic analysis on the first voice information to generate a first object information and a first task information ; The eye tracking module performs a first eye tracking according to a first eye image to generate a first gaze image information; the processor generates a first gaze image information according to a resolution of the display panel A first gaze coordinate information; the processor determines a partial frame of the current frame according to the first gaze coordinate information; the processor determines an interest object according to the partial frame; The processor generates a first selected object information according to the interest object and the first object information; and the processor executes a corresponding first task action according to the first selected object information and the first task information .

The operation method of the electronic device as described in claim 29 further includes: the display panel displays a web page with a plurality of picture information, and the picture information includes a part of the first selected object information; the processor is based on the voice input A second voice information received by the module executes a second semantic analysis to generate a second object information and a second task information; the eye tracking module performs a second eye tracking according to a second eye image, to generate a second gaze image information; the processor generates a second gaze coordinate information according to the second gaze image information and the resolution; the processor locks the picture information according to the second gaze coordinate information a piece of image information; the processor generates a second selected object information according to the image information and the second object information; and the processor executes a corresponding one according to the second selected object information and the second task information A second task action.