TWI833678B - Generative chatbot system for real multiplayer conversational and method thereof - Google Patents

Generative chatbot system for real multiplayer conversational and method thereof Download PDF

Info

Publication number
TWI833678B
TWI833678B TW112135679A TW112135679A TWI833678B TW I833678 B TWI833678 B TW I833678B TW 112135679 A TW112135679 A TW 112135679A TW 112135679 A TW112135679 A TW 112135679A TW I833678 B TWI833678 B TW I833678B
Authority
TW
Taiwan
Prior art keywords
message
portable device
server host
response
timing
Prior art date
Application number
TW112135679A
Other languages
Chinese (zh)
Inventor
邱全成
卞卓佳
Original Assignee
英業達股份有限公司
Filing date
Publication date
Application filed by 英業達股份有限公司 filed Critical 英業達股份有限公司
Application granted granted Critical
Publication of TWI833678B publication Critical patent/TWI833678B/en

Links

Images

Abstract

A generative chatbot system for real multiplayer conversational and method thereof is disclosed. By detecting multiple voices signals and convert them into corresponding feature vectors and text messages, and embedding a chronological tag and a classification tag in the text messages to store as contextual information, so as to allow the server host to determine the sequential logic of multiplayer conversations. Then, send the contextual information and the sequential logic to the artificial intelligence device to help it determine the current conversation stage, evolving topics, and predict conversation development. Subsequently, proactively generate at least one response message and store on the server host. The server host filters appropriate response message to be sent to a portable device for output. The mechanism is help to improve the conversational initiative and response efficiency.

Description

真實多人應答情境下的生成式聊天機器人之系統及其方法Generative chat robot system and method in real multi-person response situation

本發明涉及一種聊天機器人之系統及其方法,特別是真實多人應答情境下的生成式聊天機器人之系統及其方法。The present invention relates to a chat robot system and a method thereof, in particular to a generative chat robot system and a method thereof in a real multi-person response situation.

近年來,隨著人工智慧的普及與蓬勃發展,各種人工智慧的應用便如雨後春筍般地湧現。其中,又以聊天機器人最受矚目。In recent years, with the popularization and vigorous development of artificial intelligence, various artificial intelligence applications have mushroomed. Among them, chatbots attract the most attention.

一般而言,傳統的聊天機器人通常是與使用者進行一對一的對話,也就是說,當使用者傳送問題時,聊天機器人才根據問題進行回應。然而,目前尚未有聊天機器人能夠在真實的多人應答情境下,主動給予合適的應答建議或提示,舉例來說,在多人對話的環境中,傳統的聊天機器人無法主動且快速地給予使用者合適的對話建議。因此,具有聊天主動性及應答效率不佳的問題。Generally speaking, traditional chatbots usually have one-on-one conversations with users. That is to say, when users send questions, the chatbot responds based on the questions. However, there is currently no chatbot that can proactively give appropriate response suggestions or prompts in a real multi-person response situation. For example, in a multi-person conversation environment, traditional chatbots cannot proactively and quickly provide users with appropriate response suggestions or tips. Suitable conversation suggestions. Therefore, they have the initiative to chat and answer questions with poor efficiency.

綜上所述,可知先前技術在長期以來一直存在聊天主動性及應答效率不佳的問題,因此實有必要提出改進的技術手段,來解決此一問題。To sum up, it can be seen that the previous technology has had problems of poor chat initiative and poor response efficiency for a long time. Therefore, it is necessary to propose improved technical means to solve this problem.

本發明揭露一種真實多人應答情境下的生成式聊天機器人之系統及其方法。The invention discloses a system and method of a generative chat robot in a real multi-person response situation.

首先,本發明揭露一種真實多人應答情境下的生成式聊天機器人之系統,此系統包含:人工智慧裝置、可攜式裝置及伺服端主機。其中,人工智慧裝置用以通過應用程式介面(Application Programming Interface, API)接收上下文訊息及其相應的時序邏輯,並且一併輸入至大型語言模型(Large Language Model, LLM)以產生應答訊息,再通過此應用程式介面傳送所述應答訊息。所述可攜式裝置包含:感測器、揚聲器、儲存裝置及語音處理器。其中,感測器用以持續感測多個語音訊號;揚聲器用以輸出反饋語音;儲存裝置用以儲存與語音訊號相應的多個特徵向量及其相應的多個文字訊息,每一文字訊息包含時序標記及分類標記;以及語音處理器電性連接感測器、揚聲器及儲存裝置,所述語音處理器被配置為:通過梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients, MFCC)將感測到的所述語音訊號轉換為相應的所述特徵向量,用以對語音訊號進行分類;執行語音轉文字(Speech-to-Text)處理,將所述語音訊號分別轉換為相應的文字訊息;基於時序關係及分類結果,在對應語音訊號的文字訊息中嵌入時序標記及分類標記且儲存至儲存裝置作為所述上下文訊息;以及當接收到隨選對話訊息時,執行文字轉語音(Text-to-Speech)處理,將此隨選對話訊息轉換為反饋語音以通過揚聲器輸出。接著,所述伺服端主機連接人工智慧裝置及可攜式裝置,此伺服端主機包含:非暫態電腦可讀儲存媒體及硬體處理器。其中,所述非暫態電腦可讀儲存媒體用以儲存多個電腦可讀指令;以及所述硬體處理器電性連接非暫態電腦可讀儲存媒體,用以執行多個電腦可讀指令,使伺服端主機執行:持續自可攜式裝置的儲存裝置載入上下文訊息,並且根據文字訊息中嵌入的時序標記及分類標記判斷多人對話的時序邏輯,此時序邏輯包含對話的人數、時序及主題;將上下文訊息及時序邏輯傳送至人工智慧裝置,並且自人工智慧裝置接收相應的應答訊息以儲存至應答清單;以及自動從應答清單中,選擇所述應答訊息至少其中之一以作為隨選對話訊息,並且將此隨選對話訊息傳送至可攜式裝置。First, the present invention discloses a system for generating chat robots in a real multi-person response situation. This system includes: an artificial intelligence device, a portable device and a server host. Among them, the artificial intelligence device is used to receive context information and its corresponding timing logic through an application programming interface (Application Programming Interface, API), and input it into a large language model (Large Language Model, LLM) to generate a response message, and then through This API sends the response message. The portable device includes: a sensor, a speaker, a storage device and a voice processor. Among them, the sensor is used to continuously sense multiple voice signals; the speaker is used to output feedback voice; the storage device is used to store multiple feature vectors corresponding to the voice signals and multiple corresponding text messages. Each text message includes a timing mark. and a classification mark; and a voice processor electrically connected to the sensor, speaker and storage device, the voice processor is configured to: use Mel-Frequency Cepstral Coefficients (MFCC) to convert the sensed The voice signal is converted into the corresponding feature vector to classify the voice signal; speech-to-text processing is performed to convert the voice signal into corresponding text messages respectively; based on the timing relationship and classification results, embedding timing marks and classification marks in the text messages corresponding to the voice signals and storing them in the storage device as the context messages; and when receiving the on-demand dialogue message, executing text-to-speech (Text-to-Speech) Processing, converting this on-demand dialogue message into feedback speech for output through the speaker. Then, the server host is connected to the artificial intelligence device and the portable device. The server host includes: a non-transitory computer-readable storage medium and a hardware processor. Wherein, the non-transitory computer-readable storage medium is used to store a plurality of computer-readable instructions; and the hardware processor is electrically connected to the non-transitory computer-readable storage medium and is used to execute a plurality of computer-readable instructions. , causing the server host to execute: continuously load context information from the storage device of the portable device, and determine the timing logic of multi-person conversations based on the timing marks and classification marks embedded in the text message. This timing logic includes the number of people in the conversation, the timing and topic; transmit context information and timing logic to the artificial intelligence device, and receive corresponding response messages from the artificial intelligence device to store in the response list; and automatically select at least one of the response messages from the response list as a subsequent Select a conversation message and send the on-demand conversation message to the portable device.

另外,本發明還揭露一種真實多人應答情境下的生成式聊天機器人之方法,其步驟包括:將伺服端主機分別與人工智慧裝置及可攜式裝置相互連接,其中,人工智慧裝置通過應用程式介面接收上下文訊息及其相應的時序邏輯,以及傳送應答訊息;可攜式裝置通過感測器持續感測多個語音訊號,並且通過梅爾頻率倒譜係數將感測到的語音訊號轉換為相應的特徵向量,用以對所述語音訊號進行分類;可攜式裝置執行語音轉文字處理,將語音訊號分別轉換為相應的文字訊息;可攜式裝置基於時序關係及分類結果,在對應語音訊號的文字訊息中嵌入時序標記及分類標記且儲存至可攜式裝置的儲存裝置作為上下文訊息;伺服端主機持續自可攜式裝置的儲存裝置載入上下文訊息,並且根據其中嵌入的時序標記及分類標記判斷多人對話的時序邏輯,所述時序邏輯包含對話的人數、時序及主題;伺服端主機將上下文訊息及時序邏輯傳送至人工智慧裝置,用以輸入至人工智慧裝置的大型語言模型以產生相應的應答訊息,再通過應用程式介面將產生的應答訊息傳送至伺服端主機;伺服端主機將應答訊息儲存至應答清單,並且自動從應答清單中,選擇所述應答訊息至少其中之一以作為隨選對話訊息,再將此隨選對話訊息傳送至可攜式裝置;以及可攜式裝置接收到隨選對話訊息時,執行文字轉語音處理,將隨選對話訊息轉換為反饋語音以通過揚聲器輸出。In addition, the present invention also discloses a method for generating a chat robot in a real multi-person response situation. The steps include: connecting the server host to an artificial intelligence device and a portable device respectively, wherein the artificial intelligence device passes an application program The interface receives context messages and their corresponding timing logic, and transmits response messages; the portable device continuously senses multiple voice signals through the sensor, and converts the sensed voice signals into corresponding The feature vector is used to classify the speech signal; the portable device performs speech-to-text processing to convert the speech signals into corresponding text messages; the portable device converts the corresponding speech signal based on the timing relationship and classification results. The timing mark and classification mark are embedded in the text message and stored in the storage device of the portable device as context information; the server host continues to load the context information from the storage device of the portable device, and based on the timing mark and classification embedded therein Tags determine the timing logic of multi-person conversations. The timing logic includes the number of people, timing and topics of the conversation; the server host transmits contextual information and timing logic to the artificial intelligence device for input to a large language model of the artificial intelligence device to generate The corresponding response message is then sent to the server host through the application program interface; the server host stores the response message in the response list, and automatically selects at least one of the response messages from the response list as on-demand conversation message, and then transmit the on-demand conversation message to the portable device; and when the portable device receives the on-demand conversation message, it performs text-to-speech processing to convert the on-demand conversation message into feedback voice for use through the speaker output.

本發明所揭露之系統與方法如上,與先前技術的差異在於本發明是透過偵測多個語音訊號以轉換為相應的特徵向量及文字訊息,並且在文字訊息中嵌入時序標記及分類標記以儲存為上下文訊息,以便伺服端主機判斷多人對話的時序邏輯,再將上下文訊息及時序邏輯傳送至人工智慧裝置以助其確定當前對話階段、主題演變及預測對話發展,進而主動生成應答訊息並儲存至伺服端主機,以及由伺服端主機篩選出合適的應答訊息以傳送至可攜式裝置輸出。The system and method disclosed by the present invention are as above. The difference from the prior art is that the present invention detects multiple speech signals to convert them into corresponding feature vectors and text messages, and embeds timing marks and classification marks in the text messages for storage. It is contextual information that allows the server host to determine the timing logic of multi-person conversations, and then transmits the contextual information and timing logic to the artificial intelligence device to help it determine the current conversation stage, topic evolution, and predict the development of the conversation, and then actively generate and store response messages. to the server host, and the server host filters out appropriate response messages and sends them to the portable device for output.

透過上述的技術手段,本發明可以達成提高聊天主動性及應答效率之技術功效。Through the above technical means, the present invention can achieve the technical effect of improving chat initiative and response efficiency.

以下將配合圖式及實施例來詳細說明本發明之實施方式,藉此對本發明如何應用技術手段來解決技術問題並達成技術功效的實現過程能充分理解並據以實施。The embodiments of the present invention will be described in detail below with reference to the drawings and examples, so that the implementation process of how to apply technical means to solve technical problems and achieve technical effects of the present invention can be fully understood and implemented accordingly.

首先,請先參閱「第1圖」,「第1圖」為本發明真實多人應答情境下的生成式聊天機器人之系統的系統方塊圖,此系統包含:人工智慧裝置110、可攜式裝置120及伺服端主機130。其中,人工智慧裝置110用以通過應用程式介面接收上下文訊息及其相應的時序邏輯,並且一併輸入至大型語言模型以產生應答訊息,再通過應用程式介面傳送應答訊息。在實際實施上,所述人工智慧裝置110是使用大型語言模型的聊天機器人,所述大型語言模型如:生成型預訓練變換模型(Generative Pre-trained Transformer, GPT)、PaLM、Galactica、LLaMA、LaMDA或其相似物,並且能夠根據上下文訊息及其相應的時序邏輯,確定當前對話階段、主題演變及預測對話的發展,進而將預測對話作為應答訊息。First, please refer to "Figure 1". "Figure 1" is a system block diagram of a generative chat robot system in a real multi-person response situation of the present invention. This system includes: an artificial intelligence device 110, a portable device 120 and server host 130. Among them, the artificial intelligence device 110 is used to receive the context information and its corresponding timing logic through the application program interface, and input it into the large language model to generate a response message, and then transmit the response message through the application program interface. In actual implementation, the artificial intelligence device 110 is a chat robot using a large language model, such as: Generative Pre-trained Transformer (GPT), PaLM, Galactica, LLaMA, LaMDA or its analogues, and can determine the current dialogue stage, topic evolution, and predict the development of the dialogue based on contextual information and its corresponding temporal logic, and then use the predicted dialogue as a response message.

在可攜式裝置120的部分,其包含:感測器121、揚聲器122、儲存裝置123及語音處理器124。其中,感測器121用以持續感測多個語音訊號。在實際實施上,感測器121還可感測用戶的生理狀態、臉部表情及肢體動作至少其中之一以生成用戶行為訊息,並且將此用戶行為訊息傳送至伺服端主機130,由伺服端主機130判斷用戶的個性以設定個性參數。舉例來說,可以感測血壓、心跳、脈搏、血糖等生理特徵來判斷生理狀態,如:高興、興奮、沮喪等等;或是通過感測人臉、虹膜等等來判斷臉部表情及心情等等,以便通過生理狀態、臉部表情及心情來判斷用戶的個性,如:外向、內向、熱情、冷淡等等。The portable device 120 includes: a sensor 121, a speaker 122, a storage device 123 and a voice processor 124. Among them, the sensor 121 is used to continuously sense multiple voice signals. In actual implementation, the sensor 121 can also sense at least one of the user's physiological state, facial expression and body movements to generate user behavior information, and transmit the user behavior information to the server host 130, which is then The host 130 determines the user's personality to set personality parameters. For example, physiological characteristics such as blood pressure, heartbeat, pulse, and blood sugar can be sensed to determine physiological states, such as happiness, excitement, depression, etc.; or facial expressions and moods can be determined by sensing faces, irises, etc. etc., in order to judge the user's personality through physiological state, facial expression and mood, such as: extroversion, introversion, enthusiasm, indifference, etc.

揚聲器122用以輸出反饋語音。在實際實施上,揚聲器可包含耳機、喇叭或其相似物。除此之外,可攜式裝置120還可包含顯示元件,用以在揚聲器122輸出反饋語音時,同步在顯示元件顯示隨選對話訊息。在實際實施上,顯示元件可包含:顯示器、點矩陣發光二極體或其相似物。The speaker 122 is used to output feedback voice. In actual implementation, the speakers may include headphones, speakers, or the like. In addition, the portable device 120 may also include a display element for synchronously displaying on-demand dialogue messages on the display element when the speaker 122 outputs the feedback voice. In actual implementation, the display element may include: a display, a dot matrix light emitting diode or the like.

儲存裝置123用以儲存與語音訊號相應的多個特徵向量及其相應的多個文字訊息,每一文字訊息皆包含時序標記及分類標記。在實際實施上,所述儲存裝置123可包含硬碟、光碟、快閃記憶體或其相似物。除此之外,儲存裝置123還會將所有嵌入時序標記及分類標記的文字訊息一併作為上下文訊息。The storage device 123 is used to store a plurality of feature vectors corresponding to the speech signal and a plurality of corresponding text messages. Each text message includes a timing mark and a classification mark. In actual implementation, the storage device 123 may include a hard disk, an optical disk, a flash memory or the like. In addition, the storage device 123 will also use all text messages embedded with timing marks and classification marks as context messages.

語音處理器124電性連接感測器121、揚聲器122及儲存裝置123,此語音處理器124被配置為:通過梅爾頻率倒譜係數將感測到的語音訊號轉換為相應的特徵向量,用以對語音訊號進行分類;執行語音轉文字處理,將語音訊號轉換為相應的文字訊息;基於時序關係及分類結果,在對應語音訊號的文字訊息中嵌入時序標記及分類標記且儲存至儲存裝置123;以及當接收到隨選對話訊息時,執行文字轉語音處理,將隨選對話訊息轉換為反饋語音以通過揚聲器輸出,例如:通過有線或無線(藍牙)的耳機、喇叭或其相似物輸出所述反饋語音。在實際實施上,語音處理器124可以使用專用於處理語音訊號的處理器,如:數位訊號處理器(Digital Signal Processing, DSP)來實現。除此之外,可攜式裝置120更包含將用戶的語音訊號,通過梅爾頻率倒譜係數轉換為特徵向量以傳送至伺服端主機130,由伺服端主機130與預設的多個個性特徵向量進行比對以判斷出用戶的個性,並且根據判斷結果設定個性參數。The speech processor 124 is electrically connected to the sensor 121, the speaker 122 and the storage device 123. The speech processor 124 is configured to: convert the sensed speech signal into a corresponding feature vector through the Mel frequency cepstrum coefficient, using to classify the voice signals; perform speech-to-text processing to convert the voice signals into corresponding text messages; based on the timing relationship and classification results, embed timing marks and classification marks in the text messages corresponding to the voice signals and store them in the storage device 123 ; and when receiving the on-demand dialogue message, perform text-to-speech processing, convert the on-demand dialogue message into feedback voice for output through the speaker, for example: output through a wired or wireless (Bluetooth) headset, speaker or the like. feedback voice. In actual implementation, the speech processor 124 can be implemented using a processor dedicated to processing speech signals, such as a digital signal processor (Digital Signal Processing, DSP). In addition, the portable device 120 further converts the user's voice signal into a feature vector through Mel frequency cepstral coefficients and transmits it to the server host 130. The server host 130 combines the preset multiple personality characteristics with The vectors are compared to determine the user's personality, and personality parameters are set based on the judgment results.

接著,在伺服端主機130的部分,其連接人工智慧裝置110及可攜式裝置120,所述伺服端主機130包含:非暫態電腦可讀儲存媒體131及硬體處理器132。其中,非暫態電腦可讀儲存媒體用以儲存多個電腦可讀指令。在實際實施上,所述電腦可讀指令是由伺服端主機130執行,而執行本發明操作的電腦可讀指令可以是組合語言指令、指令集架構指令、機器指令、機器相關指令、微指令、韌體指令、或者以一種或多種程式語言的任意組合編寫的原始碼或目的碼(Object Code),所述程式語言包括物件導向的程式語言,如:Common Lisp、Python、C++、Objective-C、Smalltalk、Delphi、Java、Swift、C#、Perl、Ruby與PHP等,以及常規的程序式(Procedural)程式語言,如:C語言或類似的程式語言。Next, the server host 130 is connected to the artificial intelligence device 110 and the portable device 120. The server host 130 includes: a non-transitory computer-readable storage medium 131 and a hardware processor 132. Among them, the non-transitory computer-readable storage medium is used to store multiple computer-readable instructions. In actual implementation, the computer-readable instructions are executed by the server host 130, and the computer-readable instructions for performing the operations of the present invention can be combined language instructions, instruction set architecture instructions, machine instructions, machine-related instructions, micro-instructions, Firmware instructions, or source code or object code (Object Code) written in any combination of one or more programming languages, including object-oriented programming languages, such as: Common Lisp, Python, C++, Objective-C, Smalltalk, Delphi, Java, Swift, C#, Perl, Ruby and PHP, etc., as well as conventional procedural (Procedural) programming languages, such as C language or similar programming languages.

硬體處理器133電性連接非暫態電腦可讀儲存媒體131,用以執行所述多個電腦可讀指令,使伺服端主機130執行:持續自可攜式裝置120的儲存裝置123載入上下文訊息,並且根據其中嵌入的時序標記及分類標記判斷多人對話的時序邏輯,此時序邏輯包含對話的人數、時序及主題;將上下文訊息及時序邏輯傳送至人工智慧裝置110,並且自人工智慧裝置110接收相應的應答訊息以儲存至應答清單;以及自動從應答清單中,選擇所述應答訊息至少其中之一以作為隨選對話訊息,並且將此隨選對話訊息傳送至可攜式裝置120。在實際實施上,硬體處理器133可以是中央處理器、微處理器或其相似物。另外,以多人對話的時序邏輯為例,可以從分類標記的分類數量判斷人數,從時序標記判斷對話的先後順序,從上下文訊息的內容判斷主題,如搭配時間及關鍵字 ,舉例來說,假設時間為中午,關鍵字為「吃甚麼」,可以將主題判斷為「午餐討論」。另外,所述隨選對話訊息可從應答清單中隨機篩選出符合個性參數的應答訊息以作為隨選對話訊息,所述個性參數允許由可攜式裝置120連線至伺服端主機130進行設定。The hardware processor 133 is electrically connected to the non-transitory computer-readable storage medium 131 for executing the plurality of computer-readable instructions, causing the server host 130 to execute: continuously loading from the storage device 123 of the portable device 120 Context information, and determine the timing logic of multi-person conversations based on the timing tags and classification tags embedded in it. This timing logic includes the number of people, timing, and topics of the conversation; the context information and timing logic are transmitted to the artificial intelligence device 110, and are generated from the artificial intelligence device 110. The device 110 receives the corresponding response message and stores it in the response list; and automatically selects at least one of the response messages from the response list as an on-demand conversation message, and sends the on-demand conversation message to the portable device 120 . In actual implementation, the hardware processor 133 may be a central processing unit, a microprocessor or the like. In addition, taking the temporal logic of multi-person conversations as an example, the number of people can be judged from the number of classification tags, the order of the conversation can be judged from the timing tags, and the topic can be judged from the content of contextual messages, such as matching time and keywords. For example, Assuming that the time is noon and the keyword is "what to eat", the topic can be judged as "lunch discussion". In addition, the on-demand dialogue message can randomly select response messages that meet the personalized parameters from the response list as the on-demand dialogue message. The personalized parameters allow the portable device 120 to connect to the server host 130 for setting.

特別要說明的是,在實際實施上,本發明可部分地或完全地基於硬體來實現,例如,系統中的一個或多個元件可以透過積體電路晶片、系統單晶片(System on Chip, SoC)、複雜可程式邏輯裝置(Complex Programmable Logic Device, CPLD)、現場可程式邏輯閘陣列(Field Programmable Gate Array, FPGA)等硬體處理器(Hardware Processor)來實現。本發明所述的非暫態電腦可讀儲存媒體,其上載有用於使處理器實現本發明的各個方面的電腦可讀指令(或稱為電腦程式指令),非暫態電腦可讀儲存媒體可以是可以保持和儲存由指令執行設備使用的指令的有形設備。非暫態電腦可讀儲存媒體可以是但不限於電儲存設備、磁儲存設備、光儲存設備、電磁儲存設備、半導體儲存設備或上述的任意合適的組合。電腦可讀儲存媒體的更具體的例子(非窮舉的列表)包括:硬碟、隨機存取記憶體、唯讀記憶體、快閃記憶體、光碟、軟碟以及上述的任意合適的組合。此處所使用的非暫態電腦可讀儲存媒體不被解釋爲瞬時訊號本身,諸如無線電波或者其它自由傳播的電磁波、通過波導或其它傳輸媒介傳播的電磁波(例如,通過光纖電纜的光訊號)、或者通過電線傳輸的電訊號。另外,此處所描述的電腦可讀指令可以從非暫態電腦可讀儲存媒體下載到各個計算/處理設備,或者通過網路,例如:網際網路、區域網路、廣域網路及/或無線網路下載到外部電腦設備或外部儲存設備。所述網路可以包括銅傳輸電纜、光纖傳輸、無線傳輸、路由器、防火牆、交換器、集線器及/或閘道器。每一個計算/處理設備中的網路卡或者網路介面從網路接收電腦可讀指令,並轉發此電腦可讀指令,以供儲存在各個計算/處理設備中的非暫態電腦可讀儲存媒體中。It should be noted that in actual implementation, the present invention can be implemented partially or completely based on hardware. For example, one or more components in the system can be implemented through an integrated circuit chip or a system on chip (System on Chip, SoC), Complex Programmable Logic Device (CPLD), Field Programmable Gate Array (FPGA) and other hardware processors (Hardware Processor). The non-transitory computer-readable storage medium of the present invention carries computer-readable instructions (or computer program instructions) for causing the processor to implement various aspects of the present invention. The non-transitory computer-readable storage medium can A tangible device that can hold and store instructions for use by an instruction execution device. The non-transitory computer-readable storage medium may be, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: hard disks, random access memory, read-only memory, flash memory, optical disks, floppy disks, and any suitable combination of the foregoing. As used herein, non-transitory computer-readable storage media is not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical signals through fiber optic cables), Or electrical signals transmitted through wires. In addition, the computer-readable instructions described herein can be downloaded from a non-transitory computer-readable storage medium to various computing/processing devices, or through a network, such as the Internet, a local area network, a wide area network and/or a wireless network Download to an external computer device or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, hubs and/or gateways. A network card or network interface in each computing/processing device receives computer-readable instructions from the network and forwards the computer-readable instructions for non-transitory computer-readable storage in each computing/processing device in the media.

請參閱「第2A圖」及「第2B圖」,「第2A圖」及「第2B圖」為本發明真實多人應答情境下的生成式聊天機器人之方法的方法流程圖,其步驟包括:將伺服端主機130分別與人工智慧裝置110及可攜式裝置120相互連接,其中,人工智慧裝置110通過應用程式介面接收上下文訊息及其相應的時序邏輯,以及傳送應答訊息(步驟210);可攜式裝置120通過感測器持續感測多個語音訊號,並且通過梅爾頻率倒譜係數將感測到的語音訊號轉換為相應的特徵向量,用以對所述語音訊號進行分類(步驟220);可攜式裝置120執行語音轉文字處理,將語音訊號分別轉換為相應的文字訊息(步驟230);可攜式裝置120基於時序關係及分類結果,在對應語音訊號的文字訊息中嵌入時序標記及分類標記且儲存至可攜式裝置120的儲存裝置123作為上下文訊息(步驟240);伺服端主機130持續自可攜式裝置120的儲存裝置123載入上下文訊息,並且根據其中嵌入的時序標記及分類標記判斷多人對話的時序邏輯,所述時序邏輯包含對話的人數、時序及主題(步驟250);伺服端主機130將上下文訊息及時序邏輯傳送至人工智慧裝置110,用以輸入至人工智慧裝置110的大型語言模型以產生相應的應答訊息,再通過應用程式介面將產生的應答訊息傳送至伺服端主機130(步驟260);伺服端主機130將應答訊息儲存至應答清單,並且自動從應答清單中,選擇所述應答訊息至少其中之一以作為隨選對話訊息,再將此隨選對話訊息傳送至可攜式裝置120(步驟270);以及可攜式裝置120接收到隨選對話訊息時,執行文字轉語音處理,將隨選對話訊息轉換為反饋語音以通過揚聲器122輸出(步驟280)。透過上述步驟,即可透過偵測多個語音訊號以轉換為相應的特徵向量及文字訊息,並且在文字訊息中嵌入時序標記及分類標記以儲存為上下文訊息,以便伺服端主機130判斷多人對話的時序邏輯,再將上下文訊息及時序邏輯傳送至人工智慧裝置110以助其確定當前對話階段、主題演變及預測對話發展,進而主動生成應答訊息並儲存至伺服端主機130,以及由伺服端主機130篩選出合適的應答訊息以傳送至可攜式裝置120輸出。Please refer to "Figure 2A" and "Figure 2B". "Figure 2A" and "Figure 2B" are method flow charts of the method of the generative chat robot in a real multi-person response situation of the present invention. The steps include: The server host 130 is connected to the artificial intelligence device 110 and the portable device 120 respectively. The artificial intelligence device 110 receives the context message and its corresponding timing logic through the application program interface, and transmits the response message (step 210); The portable device 120 continuously senses multiple voice signals through the sensor, and converts the sensed voice signals into corresponding feature vectors through the Mel frequency cepstral coefficient to classify the voice signals (step 220 ); the portable device 120 performs speech-to-text processing and converts the voice signals into corresponding text messages (step 230); the portable device 120 embeds time series in the text messages corresponding to the voice signals based on the time series relationship and classification results. Mark and classify the tags and store them in the storage device 123 of the portable device 120 as context information (step 240); the server host 130 continues to load the context information from the storage device 123 of the portable device 120, and according to the timing embedded therein The tags and classification tags determine the timing logic of the multi-person conversation, which includes the number of people, timing, and topics of the conversation (step 250); the server host 130 transmits the context information and timing logic to the artificial intelligence device 110 for input to The large language model of the artificial intelligence device 110 generates a corresponding response message, and then sends the generated response message to the server host 130 through the application program interface (step 260); the server host 130 stores the response message in the response list, and automatically Select at least one of the response messages from the response list as an on-demand dialogue message, and then transmit the on-demand dialogue message to the portable device 120 (step 270); and the portable device 120 receives the on-demand dialogue message. When receiving a dialogue message, text-to-speech processing is performed to convert the on-demand dialogue message into feedback voice for output through the speaker 122 (step 280). Through the above steps, multiple voice signals can be detected and converted into corresponding feature vectors and text messages, and timing tags and classification tags can be embedded in the text messages to be stored as contextual information, so that the server host 130 can determine the multi-person conversation. sequential logic, and then transmit the contextual information and sequential logic to the artificial intelligence device 110 to help it determine the current dialogue stage, topic evolution and predict the dialogue development, and then actively generate a response message and store it in the server host 130, and by the server host 130 selects appropriate response messages and sends them to the portable device 120 for output.

以下配合「第3圖」至「第5圖」以實施例的方式進行如下說明,如「第3圖」所示意,「第3圖」為應用本發明的可攜式裝置之示意圖。在實際實施上,可攜式裝置120可以是智慧型手機300、錄音筆、個人數位助理(Personal Digital Assistant, PDA)等具有收音功能的可攜式裝置,其透過能夠收音的感測器,如:麥克風,持續感測人聲的語音訊號,並且通過 MFCC 的技術將感測到的語音訊號轉換為相應的特徵向量,用以對語音訊號進行分類。以智慧型手機300為例,假設透過麥克風310收音獲得多個語音訊號,並且轉換後共有三種特徵向量,那麼,代表有三個人可能在對話。接著,智慧型手機300會執行 STT 處理,將每一個語音訊號轉換為相應的文字訊息,並且基於時序及分類嵌入相應的時序標記與分類標記,其中,時序標記可包含時間、日期等等;分類標記可包含文字、數字、符號至少其中之一,用以指明不同的人員,例如:以「A」代表第一個人、以「B」代表第二個人,並以此類推,或者以「U01」代表第一個人、以「U02」代表第二個人,並以此類推。特別要說明的是,倘若麥克風310持續進行收音,則智慧型手機300會持續將其轉換為相應的特徵向量及文字訊息,以及為每一個文字訊息嵌入時序標記及分類標記,並且將儲存在儲存裝置123中的所有或指定時段(如:30分鐘內)的文字訊息一併視為上下文訊息。如此一來,伺服端主機130可以持續從儲存裝置123載入上下文訊息,並且據以判斷多人對話的時序邏輯,其包含對話的人數、時序及主題。其中,判斷人數的方式可根據分類標記的種類數量來判斷,假設有三種分類標記代表有三個人;判斷時序可根據時序標記來判斷對話先後順序;判斷主題可根據上下文訊息的內容,針對關鍵字或字詞出現的頻率或時間點進行判斷,例如,在中午提及多種食物或餐飲字詞,則可判斷主題為討論午餐。在實際實施上,上下文訊息可如「第3圖」所示意依時序顯示在智慧型手機300的顯示元件301。另外,智慧型手機300可通過藍牙耳機320輸出反饋語音。The following description is provided in the form of embodiments with reference to "Fig. 3" to "Fig. 5". As shown in "Fig. 3", "Fig. 3" is a schematic diagram of a portable device applying the present invention. In actual implementation, the portable device 120 can be a smart phone 300, a voice recorder, a personal digital assistant (PDA), or other portable device with a sound collection function, which uses a sensor capable of sound collection, such as : The microphone continuously senses the voice signal of the human voice, and converts the sensed voice signal into the corresponding feature vector through MFCC technology to classify the voice signal. Taking the smartphone 300 as an example, assuming that multiple voice signals are obtained through the microphone 310 and there are three feature vectors after conversion, it means that three people may be having a conversation. Then, the smartphone 300 will perform STT processing to convert each voice signal into a corresponding text message, and embed corresponding timing tags and classification tags based on timing and classification. The timing tags can include time, date, etc.; classification The mark may contain at least one of words, numbers, and symbols to identify different persons. For example, "A" represents the first person, "B" represents the second person, and so on, or "U01" represents For the first person, "U02" represents the second person, and so on. In particular, if the microphone 310 continues to collect sound, the smartphone 300 will continue to convert it into corresponding feature vectors and text messages, and embed timing marks and classification marks for each text message, and store them in the storage All text messages in the device 123 or within a specified period (eg, within 30 minutes) are regarded as contextual messages. In this way, the server host 130 can continuously load context information from the storage device 123 and determine the timing logic of the multi-person conversation based on it, which includes the number of people, timing and topic of the conversation. Among them, the method of judging the number of people can be judged based on the number of types of classification marks. Suppose there are three classification marks representing three people; judging the timing can be based on the timing marks to judge the order of the conversation; judging the topic can be based on the content of the contextual message, targeting keywords or Judgment is made based on the frequency or time of word occurrence. For example, if multiple food or catering words are mentioned at noon, it can be judged that the topic is discussing lunch. In actual implementation, the contextual information can be displayed on the display element 301 of the smart phone 300 in a time sequence as shown in "Figure 3". In addition, the smart phone 300 can output feedback voice through the Bluetooth headset 320 .

如「第4圖」所示意,「第4圖」為本發明的上下文訊息及時序邏輯之示意圖。在實際實施上,上下文訊息410可包含時序標記、分類標記及文字訊息。在此上下文訊息410的基礎上,伺服端主機130可以根據時序標記判斷對話的先後順序(即:對話時序),並且可賦予具有唯一性的序號作為區隔,例如可記錄為「01 -> 02 -> 03」代表各文字訊息的先後順序;根據分類標記判斷對話人數,例如:有「A」、「B」及「C」三種分類,故可判斷對話人數為三人;根據關鍵字「午餐」及時間點(如:中午時段)判斷對話主題為「討論午餐」。此時,伺服端主機130可根據上述判斷結果產生相應的時序邏輯420。特別要說明的是,在實際實施上,除了以上述舉例呈現上下文訊息410及時序邏輯420之外,兩者亦可整合在一起,如「第4圖」所示意的上下文訊息暨時序邏輯430。另外,在傳送至人工智慧裝置110以獲得相應的應答訊息時,可以通過分類標記的分類指定產生適用於此分類的人員的應答訊息。舉例來說,假設要獲得適用於「A」的應答訊息,可以在傳送上下文訊息及時序邏輯時,加入「請產生訊息供A應答」的要求。如此一來,人工智慧裝置110即可根據上下文訊息及時序邏輯,回傳相應的至少一個應答訊息至伺服端主機130以儲存至應答清單,甚至在具有上述要求的情況下,還可以只回傳滿足上述要求的應答訊息,甚至是在對話主題改變時,藉由指令來要求指定的對話主題,進而達成跨主題應答,舉例來說,在傳送上下文訊息及時序邏輯時,同時加入「在對話主題為M的前提下產生訊息供A應答」的要求,其中,M代表不同的對話主題,如:討論午餐、討論飲料等等,以便允許由使用者指定某一對話主題進行提示與回應。在實際實施上,上述要求可透過可攜式裝置120進行輸入或設定,如:通過語音輸入或鍵入文字、數字、符號等等的方式進行設定。As shown in "Figure 4", "Figure 4" is a schematic diagram of the context information and timing logic of the present invention. In actual implementation, the context information 410 may include timing marks, classification marks and text messages. Based on the context message 410, the server host 130 can determine the order of the dialogue (ie, the dialogue sequence) based on the timing mark, and can assign a unique sequence number as a partition, for example, it can be recorded as "01 -> 02 ->03" represents the order of each text message; determine the number of people talking according to the classification mark. For example: there are three categories "A", "B" and "C", so it can be judged that the number of people talking is three; according to the keyword "lunch" ” and time point (such as noon time), the conversation topic is judged to be “discussing lunch”. At this time, the server host 130 can generate corresponding sequential logic 420 based on the above determination results. It should be noted that in actual implementation, in addition to using the above example to present the context information 410 and the timing logic 420, the two can also be integrated together, such as the context information and timing logic 430 shown in "Figure 4". In addition, when transmitting to the artificial intelligence device 110 to obtain the corresponding response message, the response message applicable to the person of this classification can be generated through the classification designation of the classification mark. For example, if you want to obtain a response message applicable to "A", you can add the requirement "Please generate a message for A to respond" when sending context information and timing logic. In this way, the artificial intelligence device 110 can return at least one corresponding response message to the server host 130 according to the context information and timing logic to store it in the response list. Even in the case of the above requirements, it can also only return Response messages that meet the above requirements, even when the conversation topic changes, use instructions to request the specified conversation topic, thereby achieving cross-topic responses. For example, when sending contextual information and timing logic, add "in the conversation topic" at the same time. "Generate a message for A to respond under the premise of M", where M represents different conversation topics, such as discussing lunch, discussing drinks, etc., so as to allow the user to specify a certain conversation topic for prompts and responses. In actual implementation, the above requirements can be input or set through the portable device 120, such as through voice input or typing in text, numbers, symbols, etc.

如「第5圖」所示意,「第5圖」為應用本發明在應答清單中主動篩選出應答訊息之示意圖。假設應答清單500中已存在多筆應答訊息,伺服端主機130可以從中篩選出符合個性參數的應答訊息以作為隨選對話訊息,舉例來說,假設個性參數設定為「冷淡」,那麼伺服端主機130在選擇應答訊息時,將排除存在具有延伸對話或引導對話(如:含有問號)的應答訊息,以此例而言,將選擇「我想吃雞排飯」作為隨選對話訊息,並且將其傳送至可攜式裝置120以轉換為反饋語音,進而通過可攜式裝置120的揚聲器122輸出,如「第3圖」所示意,通過與智慧型手機300連接的藍牙耳機320輸出。在實際實施上,所述個性參數可通過使用者自行設定、由伺服端主機130根據可攜式裝置120感測到的用戶行為訊息判斷用戶的個性並加以設定、由伺服端主機130根據預設的多個個性特徵向量與用戶的特徵向量進行比對後,判斷出用戶的個性並據以設定。舉例來說,可將低沉聲音的特徵向量視為代表「冷淡」的個性特徵向量、將高昂聲音的特徵向量視為代表「熱情」的個性特徵向量,當可攜式裝置120的使用者,其語音的特徵向量與代表「冷淡」的個性特徵向量相符時,伺服端主機130可將其個性參數設定為「冷淡」。As shown in "Figure 5", "Figure 5" is a schematic diagram of applying the present invention to actively filter out response messages from the response list. Assuming that there are multiple response messages in the response list 500, the server host 130 can filter out the response messages that meet the personality parameters as on-demand conversation messages. For example, assuming that the personality parameter is set to "cold", then the server host 130 When selecting a response message, response messages with extended dialogue or guided dialogue (such as containing question marks) will be excluded. For example, "I want to eat chicken chop rice" will be selected as the on-demand dialogue message, and It is transmitted to the portable device 120 to be converted into feedback voice, and then output through the speaker 122 of the portable device 120. As shown in "Figure 3", it is output through the Bluetooth headset 320 connected to the smartphone 300. In actual implementation, the personality parameters can be set by the user. The server host 130 determines the user's personality and sets it based on the user behavior information sensed by the portable device 120. The server host 130 determines the user's personality according to the default setting. After comparing multiple personality feature vectors with the user's feature vector, the user's personality is determined and set accordingly. For example, the feature vector of a deep voice can be regarded as a personality feature vector representing "coldness", and the feature vector of a high-pitched voice can be regarded as a personality feature vector representing "enthusiasm". When the user of the portable device 120, its When the feature vector of the voice matches the personality feature vector representing "coldness", the server host 130 can set the personality parameter to "coldness".

綜上所述,可知本發明與先前技術之間的差異在於透過偵測多個語音訊號以轉換為相應的特徵向量及文字訊息,並且在文字訊息中嵌入時序標記及分類標記以儲存為上下文訊息,以便伺服端主機判斷多人對話的時序邏輯,再將上下文訊息及時序邏輯傳送至人工智慧裝置以助其確定當前對話階段、主題演變及預測對話發展,進而主動生成應答訊息並儲存至伺服端主機,以及由伺服端主機篩選出合適的應答訊息以傳送至可攜式裝置輸出,藉由此一技術手段可以解決先前技術所存在的問題,進而達成提高聊天主動性及應答效率之技術功效。To sum up, it can be seen that the difference between the present invention and the prior art is to detect multiple voice signals to convert them into corresponding feature vectors and text messages, and to embed timing marks and classification marks in the text messages to store them as contextual information. , so that the server host can judge the timing logic of multi-person conversations, and then transmit the contextual information and timing logic to the artificial intelligence device to help it determine the current conversation stage, topic evolution and predict the development of the conversation, and then actively generate response messages and store them in the server The host, and the server host selects appropriate response messages and transmits them to the portable device for output. This technical means can solve the problems existing in the previous technology, thereby achieving the technical effect of improving chat initiative and response efficiency.

雖然本發明以前述之實施例揭露如上,然其並非用以限定本發明,任何熟習相像技藝者,在不脫離本發明之精神和範圍內,當可作些許之更動與潤飾,因此本發明之專利保護範圍須視本說明書所附之申請專利範圍所界定者為準。Although the present invention has been disclosed in the foregoing embodiments, they are not intended to limit the present invention. Anyone skilled in the similar art can make some modifications and modifications without departing from the spirit and scope of the present invention. Therefore, the present invention is The scope of patent protection shall be determined by the scope of the patent application attached to this specification.

110:人工智慧裝置 120:可攜式裝置 121:感測器 122:揚聲器 123:儲存裝置 124:語音處理器 130:伺服端主機 131:非暫態電腦可讀儲存媒體 132:硬體處理器 300:智慧型手機 301:顯示元件 310:麥克風 320:藍牙耳機 410:上下文訊息 420:時序邏輯 430:上下文訊息暨時序邏輯 500:應答清單 步驟210:將一伺服端主機分別與一人工智慧裝置及一可攜式裝置相互連接,其中,該人工智慧裝置通過一應用程式介面(Application Programming Interface, API)接收一上下文訊息及其相應的一時序邏輯,以及傳送至少一應答訊息 步驟220:該可攜式裝置通過至少一感測器持續感測多個語音訊號,並且通過梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients, MFCC)將感測到的所述語音訊號轉換為相應的所述特徵向量,用以對所述語音訊號進行分類 步驟230:該可攜式裝置執行語音轉文字(Speech-to-Text)處理,將所述語音訊號分別轉換為相應的一文字訊息 步驟240:該可攜式裝置基於時序關係及分類結果,在對應所述語音訊號的所述文字訊息中嵌入所述時序標記及所述分類標記且儲存至該可攜式裝置的一儲存裝置作為所述上下文訊息 步驟250:該伺服端主機持續自該可攜式裝置的該儲存裝置載入所述上下文訊息,並且根據其中嵌入的所述時序標記及所述分類標記判斷多人對話的一時序邏輯,該時序邏輯包含對話的人數、時序及主題 步驟260:該伺服端主機將所述上下文訊息及該時序邏輯傳送至該人工智慧裝置,用以輸入至該人工智慧裝置的大型語言模型(Large Language Model, LLM)以產生相應的所述應答訊息,再通過該應用程式介面將產生的所述應答訊息傳送至該伺服端主機 步驟270:該伺服端主機將所述應答訊息儲存至一應答清單,並且自動從該應答清單中,選擇所述應答訊息至少其中之一以作為一隨選對話訊息,再將該隨選對話訊息傳送至該可攜式裝置 步驟280:該可攜式裝置接收到該隨選對話訊息時,執行文字轉語音(Text-to-Speech)處理,將該隨選對話訊息轉換為一反饋語音以通過該揚聲器輸出 110:Artificial intelligence device 120: Portable device 121: Sensor 122: Speaker 123:Storage device 124: Voice processor 130:Server host 131: Non-transitory computer-readable storage media 132:Hardware processor 300:Smartphone 301:Display component 310:Microphone 320: Bluetooth headset 410:Context message 420: Sequential logic 430:Context information and temporal logic 500:Response list Step 210: Connect a server host to an artificial intelligence device and a portable device respectively, wherein the artificial intelligence device receives a context message and its corresponding time through an application programming interface (Application Programming Interface, API). sequence logic, and transmit at least one response message Step 220: The portable device continuously senses multiple voice signals through at least one sensor, and converts the sensed voice signals into The corresponding feature vector is used to classify the speech signal Step 230: The portable device performs speech-to-text processing to convert the voice signals into corresponding text messages. Step 240: Based on the timing relationship and the classification result, the portable device embeds the timing mark and the classification mark in the text message corresponding to the voice signal and stores it in a storage device of the portable device as the contextual message Step 250: The server host continues to load the context information from the storage device of the portable device, and determines a timing logic of the multi-person conversation based on the timing mark and the classification mark embedded therein. Logic includes the number of people, timing and topics of the conversation Step 260: The server host transmits the context information and the timing logic to the artificial intelligence device for input into the large language model (Large Language Model, LLM) of the artificial intelligence device to generate the corresponding response message. , and then transmit the generated response message to the server host through the application programming interface Step 270: The server host stores the response messages in a response list, and automatically selects at least one of the response messages from the response list as an on-demand dialogue message, and then sends the on-demand dialogue message. Send to the portable device Step 280: When the portable device receives the on-demand dialogue message, it performs text-to-speech processing to convert the on-demand dialogue message into a feedback voice for output through the speaker.

第1圖為本發明真實多人應答情境下的生成式聊天機器人之系統的系統方塊圖。 第2A圖及第2B圖為本發明真實多人應答情境下的生成式聊天機器人之方法的方法流程圖。 第3圖為應用本發明的可攜式裝置之示意圖。 第4圖為本發明的上下文訊息及時序邏輯之示意圖。 第5圖為應用本發明在應答清單中主動篩選出應答訊息之示意圖。 Figure 1 is a system block diagram of a generative chat robot system in a real multi-person response situation of the present invention. Figure 2A and Figure 2B are method flow charts of the method of the present invention's generative chat robot in a real multi-person response situation. Figure 3 is a schematic diagram of a portable device using the present invention. Figure 4 is a schematic diagram of context information and timing logic of the present invention. Figure 5 is a schematic diagram of applying the present invention to actively filter out response messages from the response list.

110:人工智慧裝置 110:Artificial intelligence device

120:可攜式裝置 120: Portable device

121:感測器 121: Sensor

122:揚聲器 122: Speaker

123:儲存裝置 123:Storage device

124:語音處理器 124: Voice processor

130:伺服端主機 130:Server host

131:非暫態電腦可讀儲存媒體 131: Non-transitory computer-readable storage media

132:硬體處理器 132:Hardware processor

Claims (10)

一種真實多人應答情境下的生成式聊天機器人之系統,該系統包含:該人工智慧裝置,用以通過一應用程式介面(Application Programming Interface,API)接收一上下文訊息及其相應的一時序邏輯,並且一併輸入至大型語言模型(Large Language Model,LLM)以產生至少一應答訊息,再通過該應用程式介面傳送所述應答訊息;該可攜式裝置包含:至少一感測器,用以持續感測多個語音訊號;一揚聲器,用以輸出一反饋語音;一儲存裝置,用以儲存與所述語音訊號相應的多個特徵向量及其相應的多個文字訊息,每一所述文字訊息包含一時序標記及一分類標記;以及一語音處理器,電性連接所述感測器、所述揚聲器及所述儲存裝置,該語音處理器被配置為:通過梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients,MFCC)將感測到的所述語音訊號轉換為相應的所述特徵向量,用以對所述語音訊號進行分類;執行語音轉文字(Speech-to-Text)處理,將所述語音訊號分別轉換為相應的所述文字訊息; 基於時序關係及分類結果,在對應所述語音訊號的所述文字訊息中嵌入所述時序標記及所述分類標記且儲存至該儲存裝置作為所述上下文訊息;以及當接收到一隨選對話訊息時,執行文字轉語音(Text-to-Speech)處理,將該隨選對話訊息轉換為該反饋語音以通過該揚聲器輸出;以及該伺服端主機,連接該人工智慧裝置及該可攜式裝置,該伺服端主機包含:一非暫態電腦可讀儲存媒體,用以儲存多個電腦可讀指令;以及一硬體處理器,電性連接所述非暫態電腦可讀儲存媒體,用以執行所述多個電腦可讀指令,使該伺服端主機執行:持續自該可攜式裝置的該儲存裝置載入所述上下文訊息,並且根據其中嵌入的所述時序標記及所述分類標記判斷多人對話的一時序邏輯,該時序邏輯包含對話的人數、時序及主題;將所述上下文訊息及該時序邏輯傳送至該人工智慧裝置,並且自該人工智慧裝置接收相應的所述應答訊息以儲存至一應答清單;以及 自動從該應答清單中,選擇所述應答訊息至少其中之一以作為該隨選對話訊息,並且將該隨選對話訊息傳送至該可攜式裝置。 A system of generative chat robots in a real multi-person response situation. The system includes: the artificial intelligence device is used to receive a contextual message and its corresponding temporal logic through an application programming interface (Application Programming Interface, API). and is input to a Large Language Model (LLM) to generate at least one response message, and then transmits the response message through the application programming interface; the portable device includes: at least one sensor for continuously Sensing multiple voice signals; a speaker for outputting a feedback voice; a storage device for storing multiple feature vectors corresponding to the voice signals and corresponding multiple text messages, each of the text messages including a timing mark and a classification mark; and a speech processor electrically connected to the sensor, the speaker and the storage device, the speech processor is configured to: pass the Mel frequency cepstral coefficient (Mel -Frequency Cepstral Coefficients (MFCC) converts the sensed speech signal into the corresponding feature vector to classify the speech signal; performs speech-to-text (Speech-to-Text) processing to convert all The voice signals are respectively converted into the corresponding text messages; Based on the timing relationship and the classification result, the timing mark and the classification mark are embedded in the text message corresponding to the voice signal and stored in the storage device as the context message; and when an on-demand dialogue message is received When performing text-to-speech processing, convert the on-demand dialogue message into the feedback voice for output through the speaker; and the server host connects the artificial intelligence device and the portable device, The server host includes: a non-transitory computer-readable storage medium for storing a plurality of computer-readable instructions; and a hardware processor electrically connected to the non-transitory computer-readable storage medium for executing The plurality of computer-readable instructions cause the server host to execute: continue to load the context information from the storage device of the portable device, and determine multiple context information based on the timing mark and the classification mark embedded therein. A temporal logic of human dialogue, the temporal logic includes the number of people, timing and topics of the conversation; transmit the contextual information and the temporal logic to the artificial intelligence device, and receive the corresponding response message from the artificial intelligence device to store to a response list; and Automatically select at least one of the response messages from the response list as the on-demand conversation message, and send the on-demand conversation message to the portable device. 如請求項1之真實多人應答情境下的生成式聊天機器人之系統,其中所述隨選對話訊息係自該應答清單中隨機篩選出符合一個性參數的所述應答訊息以作為所述隨選對話訊息,所述個性參數允許由該可攜式裝置連線至該伺服端主機進行設定。 A system for a generative chat robot in a real multi-person response situation as in request item 1, wherein the on-demand dialogue message is randomly selected from the response list and the response message that meets a parameter is used as the on-demand dialogue message. Dialog message, the personalized parameters allow the portable device to connect to the server host for setting. 如請求項1之真實多人應答情境下的生成式聊天機器人之系統,其中該可攜式裝置更包含一顯示元件,用以在該揚聲器輸出所述反饋語音時,同步在該顯示元件顯示該隨選對話訊息。 As claimed in Claim 1, the system of a generative chat robot in a real multi-person response situation, wherein the portable device further includes a display element for synchronously displaying the feedback voice on the display element when the speaker outputs the feedback voice. On-demand chat messages. 如請求項2之真實多人應答情境下的生成式聊天機器人之系統,其中所述感測器更包含感測用戶的生理狀態、臉部表情及肢體動作至少其中之一以生成一用戶行為訊息,並且該可攜式裝置將該用戶行為訊息傳送至該伺服端主機,由該伺服端主機判斷用戶的個性以設定所述個性參數。 A system for a generative chat robot in a real multi-person response situation as claimed in claim 2, wherein the sensor further includes sensing at least one of the user's physiological state, facial expressions and body movements to generate a user behavior message , and the portable device transmits the user behavior information to the server host, and the server host determines the user's personality to set the personality parameters. 如請求項2之真實多人應答情境下的生成式聊天機器人之系統,其中該可攜式裝置更包含將用戶的所述語音訊號,通過梅爾頻率倒譜係數轉換為所述特徵向量以傳送至該伺服端主機,由該伺服端主機與預設的多個個性特徵向量進行比對以判斷出用戶的個性,並且根據判斷結果設定所述個性參數。 A system for a generative chat robot in a real multi-person response situation as claimed in claim 2, wherein the portable device further includes converting the user's voice signal into the feature vector through a mel frequency cepstrum coefficient for transmission To the server host, the server host compares the user with a plurality of preset personality feature vectors to determine the user's personality, and sets the personality parameters according to the judgment results. 一種真實多人應答情境下的生成式聊天機器人之方法,其步驟包括: 將一伺服端主機分別與一人工智慧裝置及一可攜式裝置相互連接,其中,該人工智慧裝置通過一應用程式介面(Application Programming Interface,API)接收一上下文訊息及其相應的一時序邏輯,以及傳送至少一應答訊息;該可攜式裝置通過至少一感測器持續感測多個語音訊號,並且通過梅爾頻率倒譜係數(Mel-Frequency Cepstral Coefficients,MFCC)將感測到的所述語音訊號轉換為相應的多個特徵向量,用以對所述語音訊號進行分類;該可攜式裝置執行語音轉文字(Speech-to-Text)處理,將所述語音訊號分別轉換為相應的一文字訊息;該可攜式裝置基於時序關係及分類結果,在對應所述語音訊號的所述文字訊息中嵌入所述時序標記及所述分類標記且儲存至該可攜式裝置的一儲存裝置作為所述上下文訊息;該伺服端主機持續自該可攜式裝置的該儲存裝置載入所述上下文訊息,並且根據其中嵌入的所述時序標記及所述分類標記判斷多人對話的一時序邏輯,該時序邏輯包含對話的人數、時序及主題;該伺服端主機將所述上下文訊息及該時序邏輯傳送至該人工智慧裝置,用以輸入至該人工智慧裝置的大型語言模型(Large Language Model,LLM)以產生相應的所述應答訊息,再通過該應用程式介面將產生的所述應答訊息傳送至該伺服端主機; 該伺服端主機將所述應答訊息儲存至一應答清單,並且自動從該應答清單中,選擇所述應答訊息至少其中之一以作為一隨選對話訊息,再將該隨選對話訊息傳送至該可攜式裝置;以及該可攜式裝置接收到該隨選對話訊息時,執行文字轉語音(Text-to-Speech)處理,將該隨選對話訊息轉換為一反饋語音以通過一揚聲器輸出。 A method of generating chatbots in a real multi-person response situation, the steps include: A server host is connected to an artificial intelligence device and a portable device respectively, wherein the artificial intelligence device receives a context message and a corresponding timing logic through an application programming interface (Application Programming Interface, API), and transmit at least one response message; the portable device continuously senses multiple voice signals through at least one sensor, and uses Mel-Frequency Cepstral Coefficients (MFCC) to The voice signal is converted into corresponding feature vectors for classifying the voice signal; the portable device performs speech-to-text processing to convert the voice signal into a corresponding text respectively. Message; based on the timing relationship and the classification result, the portable device embeds the timing mark and the classification mark in the text message corresponding to the voice signal and stores it in a storage device of the portable device as the The context information; the server host continues to load the context information from the storage device of the portable device, and determines a timing logic of the multi-person conversation based on the timing mark and the classification mark embedded therein. Sequential logic includes the number of people, timing, and topics of the conversation; the server host transmits the context information and the temporal logic to the artificial intelligence device for input into the large language model (Large Language Model, LLM) of the artificial intelligence device. To generate the corresponding response message, and then transmit the generated response message to the server host through the application programming interface; The server host stores the response messages in a response list, and automatically selects at least one of the response messages from the response list as an on-demand dialogue message, and then sends the on-demand dialogue message to the A portable device; and when the portable device receives the on-demand dialogue message, it performs text-to-speech processing to convert the on-demand dialogue message into a feedback voice for output through a speaker. 如請求項6之真實多人應答情境下的生成式聊天機器人之方法,其中所述隨選對話訊息係自該應答清單中隨機篩選出符合一個性參數的所述應答訊息以作為所述隨選對話訊息,所述個性參數允許由該可攜式裝置連線至該伺服端主機進行設定。 The method of request 6 for a generative chat robot in a real multi-person response situation, wherein the on-demand dialogue message is randomly selected from the response list and the response message that conforms to a parameter is used as the on-demand dialogue message. Dialog message, the personalized parameters allow the portable device to connect to the server host for setting. 如請求項6之真實多人應答情境下的生成式聊天機器人之方法,其中該可攜式裝置更包含一顯示元件,用以在該揚聲器輸出所述反饋語音時,同步在該顯示元件顯示該隨選對話訊息。 The method of claim 6 for a generative chat robot in a real multi-person response situation, wherein the portable device further includes a display element for synchronously displaying the feedback voice on the display element when the speaker outputs the feedback voice. On-demand chat messages. 如請求項7之真實多人應答情境下的生成式聊天機器人之方法,其中所述感測器更包含感測用戶的生理狀態、臉部表情及肢體動作至少其中之一以生成一用戶行為訊息,並且該可攜式裝置將該用戶行為訊息傳送至該伺服端主機,由該伺服端主機判斷用戶的個性以設定所述個性參數。 The method of claim 7 for a generative chat robot in a real multi-person response situation, wherein the sensor further includes sensing at least one of the user's physiological state, facial expression and body movements to generate a user behavior message , and the portable device transmits the user behavior information to the server host, and the server host determines the user's personality to set the personality parameters. 如請求項7之真實多人應答情境下的生成式聊天機器人之方法,其中該可攜式裝置更包含將用戶的所述語音訊號,通過梅爾頻率倒譜係數轉換為所述特徵向量以傳送至該伺服端主機, 由該伺服端主機與預設的多個個性特徵向量進行比對以判斷出用戶的個性,並且根據判斷結果設定所述個性參數。 The method of request 7 for a generative chat robot in a real multi-person response situation, wherein the portable device further includes converting the user's voice signal into the feature vector through a mel frequency cepstrum coefficient for transmission To the server host, The server host compares the user's personality with a plurality of preset personality feature vectors to determine the user's personality, and sets the personality parameters according to the determination results.
TW112135679A 2023-09-19 Generative chatbot system for real multiplayer conversational and method thereof TWI833678B (en)

Publications (1)

Publication Number Publication Date
TWI833678B true TWI833678B (en) 2024-02-21

Family

ID=

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230252241A1 (en) 2019-07-22 2023-08-10 Capital One Services, Llc Multi-turn dialogue response generation with persona modeling

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230252241A1 (en) 2019-07-22 2023-08-10 Capital One Services, Llc Multi-turn dialogue response generation with persona modeling

Similar Documents

Publication Publication Date Title
US10586539B2 (en) In-call virtual assistant
KR102509464B1 (en) Utterance classifier
US10970492B2 (en) IoT-based call assistant device
US11508378B2 (en) Electronic device and method for controlling the same
WO2017200072A1 (en) Dialog method, dialog system, dialog device, and program
JP2013509652A (en) System and method for tactile enhancement of speech-to-text conversion
JPWO2017200080A1 (en) Dialogue method, dialogue apparatus, and program
US20230046658A1 (en) Synthesized speech audio data generated on behalf of human participant in conversation
CN111542814A (en) Method, computer device and computer readable storage medium for changing responses to provide rich-representation natural language dialog
US20210125610A1 (en) Ai-driven personal assistant with adaptive response generation
US11830502B2 (en) Electronic device and method for controlling the same
CN111556999B (en) Method, computer device and computer readable storage medium for providing natural language dialogue by providing substantive answer in real time
KR20210042523A (en) An electronic apparatus and Method for controlling the electronic apparatus thereof
JP7063230B2 (en) Communication device and control program for communication device
CN111557001B (en) Method for providing natural language dialogue, computer device and computer readable storage medium
CN109074809A (en) Information processing equipment, information processing method and program
TWI833678B (en) Generative chatbot system for real multiplayer conversational and method thereof
JP2021117371A (en) Information processor, information processing method and information processing program
JP2021113835A (en) Voice processing device and voice processing method
Mruthyunjaya et al. Human-Augmented robotic intelligence (HARI) for human-robot interaction
Fujii et al. Open source system integration towards natural interaction with robots
CN117171323A (en) System and method for generating chat robot under real multi-person response situation
Lin et al. Nonverbal acoustic communication in human-computer interaction
US11657814B2 (en) Techniques for dynamic auditory phrase completion
TWI838316B (en) Generative chatbot system for virtual community and method thereof