TWI839285B

TWI839285B - Image-to-speech assistive device for the visually impaired

Info

Publication number: TWI839285B
Application number: TW112129404A
Authority: TW
Inventors: 蘇家榮; 游靖晧
Original assignee: 上弘醫療設備股份有限公司
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2024-04-11

Abstract

本發明為有關一種影像轉語音之視障輔助裝置，主要結構包括一具有攝像元件之頭戴式裝置，其上設有一第一無線通訊元件、一影像辨識模組、一文章生成模組、一語音轉換模組、及一播放裝置，另具有一導盲杖，該導盲杖具有一供驅動攝像元件之第一控制部、及一無線連結第一無線通訊元件之第二無線通訊元件。藉此，視障人員只要利用導盲杖上的第一控制部驅動攝像元件，使其拍攝眼前的景象，而後經過物件辨識技術與chatGPT的人工智慧自然語言處理技術，將影像資訊轉為文字，並透過播放裝置語音輸出給視障人員，以達到操作方便，快速辨識，自動口語說明眼前情境之功效。 The present invention is related to a visually impaired assistive device for converting images into speech. The main structure includes a head-mounted device with a camera element, on which a first wireless communication element, an image recognition module, an article generation module, a speech conversion module, and a playback device are provided. In addition, there is a guide stick, which has a first control unit for driving the camera element and a second wireless communication element wirelessly connected to the first wireless communication element. In this way, the visually impaired person only needs to use the first control unit on the guide stick to drive the camera element to shoot the scene in front of them, and then the image information is converted into text through object recognition technology and chatGPT's artificial intelligence natural language processing technology, and the voice is output to the visually impaired person through the playback device, so as to achieve the effect of convenient operation, rapid recognition, and automatic oral explanation of the situation in front of them.

Description

Image-to-speech assistive device for the visually impaired

本發明為提供一種操作動作簡便隱蔽、能快速辨識物品以自動口語說明眼前情境的影像轉語音之視障輔助裝置。 The present invention provides a visually impaired assistive device that has simple and concealed operating actions, can quickly identify objects and automatically explain the current situation verbally.

按，視覺障礙者是指在視覺器官的構造或機能發生部分或全部障礙的人，其對外界事物無法或較難辨識，往往需要視覺輔助裝置進行輔助以完成日常生活中的相關行為。 According to the law, visually impaired people refer to those who have partial or complete impairments in the structure or function of their visual organs. They are unable to or have difficulty recognizing external objects and often need visual assistive devices to assist them in completing related behaviors in daily life.

隨著近年來的科技進步，許多功能透過數位化的技術實踐，應用於輔助裝置上也能給予使用者更多的協助，在身障以外的使用者也能提供諸多協助。目前將輔具數位化的應用較少，僅有少數結合影像辨識技術者，但是對於複雜的人工智慧運算，需要比較大型的處理器及龐大的資料庫，這些大型系統不適合隨身攜帶，且由於人工智慧在處理龐大的影像資訊時，其反應時間並無法跟上使用者需求，使得視覺障礙人員無法有一個友善的外出輔具，或使用上較不便利。 With the advancement of technology in recent years, many functions can be applied to assistive devices through digital technology to provide users with more assistance, and can also provide a lot of assistance to users other than those with disabilities. Currently, there are few applications of digital assistive devices, and only a few are combined with image recognition technology. However, for complex artificial intelligence calculations, relatively large processors and large databases are required. These large systems are not suitable for carrying around, and because artificial intelligence cannot keep up with user needs when processing large amounts of image information, the response time is not fast enough for visually impaired people to have a friendly assistive device for going out, or it is less convenient to use.

以中華民國專利第M584676號「數位輔具」為例，於使用時仍存在下列問題與缺失尚待改進： Taking the Republic of China patent No. M584676 "Digital Assistive Device" as an example, the following problems and deficiencies still exist when using it and need to be improved:

第一，其攝像鏡頭結合於穿戴配件上，隨時進行前方影像的辨識及語音回饋，但使用者並非一直需要語音輔助，且無法控制攝像鏡頭的動作時機，導致使用者耳朵接收訊息繁雜，甚至掩蓋周圍的其他聲音，反而更危險。 First, the camera is integrated with wearable accessories to identify the image in front and provide voice feedback at any time, but the user does not always need voice assistance, and cannot control the timing of the camera's movement, causing the user's ears to receive complex information and even mask other sounds around them, which is more dangerous.

第二，其攝像鏡頭為高畫質鏡頭，且不斷進行動態拍攝，對運算裝置的負擔極大，而影響語音回饋速度。 Second, its camera lens is a high-definition lens, and it continuously shoots dynamic images, which puts a heavy burden on the computing device and affects the speed of voice feedback.

第三，其語意輸出乃基於Tesseract及TTS(Text to Speech)技術，分析文字排列所產生的語意，進而轉換為語音輸出，此二技術並未使用人工智慧的自然語言處理技術，僅能以預錄的字詞拼湊，所生成之語意較為生硬。 Third, its semantic output is based on Tesseract and TTS (Text to Speech) technology, which analyzes the semantics generated by the text arrangement and then converts it into voice output. These two technologies do not use artificial intelligence natural language processing technology, but can only splice pre-recorded words, and the generated semantics are relatively stiff.

是以，要如何解決上述習用之問題與缺失，即為本發明之申請人與從事此行業之相關廠商所亟欲研究改善之方向所在者。 Therefore, how to solve the above-mentioned problems and deficiencies in usage is the direction that the applicant of this invention and related manufacturers engaged in this industry are eager to study and improve.

故，本發明之申請人有鑑於上述缺失，乃蒐集相關資料，經由多方評估及考量，並以從事於此行業累積之多年經驗，經由不斷試作及修改，始設計出此種操作動作簡便隱蔽、能快速辨識物品以自動口語說明眼前情境的影像轉語音之視障輔助裝置的發明專利者。 Therefore, in view of the above-mentioned deficiencies, the applicant of this invention has collected relevant information, evaluated and considered various aspects, and used the years of experience accumulated in this industry, after continuous trials and modifications, to design the invention patent of this image-to-speech assistive device for the visually impaired, which has simple and concealed operation, can quickly identify objects and automatically explain the current situation verbally.

本發明之主要目的在於：文章生成模組乃由深度神經網絡模型構成的語言模型chatGPT，係以AI人工智慧技術驅動自然語言處理技術的升級版聊天機器人，可自我學習、自主完善，模擬人類進行較自然流暢的對話，因此結合影像辨識模組能產生以文意通暢的句子描述影像資訊的內容，而利於使用者理解。 The main purpose of this invention is that the article generation module is a language model chatGPT composed of a deep neural network model. It is an upgraded chat robot that uses AI artificial intelligence technology to drive natural language processing technology. It can self-learn and improve itself, and simulate humans to have a more natural and fluent conversation. Therefore, combined with the image recognition module, it can generate sentences that describe the content of the image information in a fluent manner, which is conducive to user understanding.

本發明之另一主要目的在於：以導盲杖上的第一控制部驅動頭戴式裝置上的攝像元件，可僅在需要的時候的發出辨識請求，操作上更為簡單、隱蔽。 Another main purpose of the present invention is to use the first control unit on the guide stick to drive the camera element on the head-mounted device, so that the identification request can be issued only when necessary, which is simpler and more concealed in operation.

為達成上述目的，本發明之主要結構包括：一具有攝像元件之頭戴式裝置、一第一無線通訊元件、一影像辨識模組、一文章生成模組、一語音轉換模組、及一播放裝置，另具有一導盲杖，該導盲杖具有一第一控制部、及一第二無線通訊元件，其中該攝像元件設於頭戴式裝置上，且拍照方向與頭戴式裝置之視線方向相同，以供取得一影像資訊，該第一無線通訊元件設於頭戴式裝置上且電性連結攝像元件，該第一控制部供驅動攝像元件，該第二無線通訊元件電性連結第一控制部且無線連結第一無線通訊元件，該影像辨識模組設於頭戴式裝置內且電性連結攝像元件，以從影像資訊中辨識出複數個物品資訊，該文章生成模組設於頭戴式裝置內且電性連結影像辨識模組，並具有一影像描述資料庫，以根據些物品資訊產生一文意通暢之文字資訊，且文章生成模組係為由深度神經網絡模型構成之語言模型chatGPT，該語音轉換模組設於頭戴式裝置內且電性連結文章生成模組，以將該文字資訊轉為語音訊息，而該播放裝置設於頭戴式裝置上且電性連結語音轉換模組。 To achieve the above-mentioned purpose, the main structure of the present invention includes: a head-mounted device with a camera element, a first wireless communication element, an image recognition module, an article generation module, a voice conversion module, and a playback device, and a guide stick for the blind, the guide stick has a first control unit, and a second wireless communication element, wherein the camera element is arranged on the head-mounted device, and the shooting direction is the same as the sight direction of the head-mounted device to obtain image information, the first wireless communication element is arranged on the head-mounted device and is electrically connected to the camera element, the first control unit is used to drive the camera element, and the second wireless communication element is electrically connected to the first control unit and is wirelessly connected to the first control unit. The first wireless communication element is connected, the image recognition module is arranged in the head-mounted device and is electrically connected to the camera element to recognize multiple item information from the image information, the article generation module is arranged in the head-mounted device and is electrically connected to the image recognition module, and has an image description database to generate a fluent text information according to the item information, and the article generation module is a language model chatGPT composed of a deep neural network model, the voice conversion module is arranged in the head-mounted device and is electrically connected to the article generation module to convert the text information into a voice message, and the playback device is arranged on the head-mounted device and is electrically connected to the voice conversion module.

俾當使用者將本發明作為視障輔助裝置時，只要配戴頭戴式裝置、手持導盲杖，並透過第一無線通訊元件與第二無線通訊元件進行頭戴式裝置與導盲杖的無線連結，即可隨時進行辨識及轉換。當使用者移動至任一定點，並想確認眼前的人事物景象時，只要利用導盲杖上的第一控制部驅動攝像元件，以拍攝取得一靜態影像資訊，此時影像辨識模組會自動辨識出影像資訊中的物品資訊，並由文章生成模組結合影像描述資料庫中的資料，整合影像資訊及所有物品資訊產生文意思通暢的文字資訊，最後利用語音轉換模組將文字資訊轉換為語音訊息，由播放裝置播放給使用者聽。藉此，得以簡單隱蔽的操作方式，輔助使用者對周遭環境視覺辨識力的不足。 When the user uses the present invention as a visually impaired assistive device, he only needs to wear a head-mounted device, hold a guide stick, and wirelessly connect the head-mounted device and the guide stick through the first wireless communication element and the second wireless communication element, so that he can identify and convert at any time. When the user moves to any fixed point and wants to confirm the people, things and scenes in front of him, he only needs to use the first control unit on the guide stick to drive the camera element to shoot and obtain a static image information. At this time, the image recognition module will automatically identify the object information in the image information, and the text generation module will combine the data in the image description database to integrate the image information and all the object information to generate text information with fluent meaning. Finally, the voice conversion module will convert the text information into voice messages, which will be played to the user by the playback device. This allows for a simple and concealed operation method to assist users with their limited visual perception of the surrounding environment.

藉由上述技術，可針對習用數位輔具所存在之無法控制辨識時機、語音訊息繁雜影響正常聽力、動態影像辨識導致負擔大效率低、及語意輸出未介入人工智慧較為生硬等問題點加以突破，達到上述優點之實用進步性。 Through the above technology, we can overcome the problems of digital assistive devices such as the inability to control recognition timing, complex voice messages affecting normal hearing, dynamic image recognition leading to heavy burden and low efficiency, and semantic output without artificial intelligence and being relatively stiff, and achieve the practical progress of the above advantages.

1:頭戴式裝置 1: Head-mounted device

11:攝像元件 11: Imaging components

111:第一無線通訊元件 111: First wireless communication element

12:影像辨識模組 12: Image recognition module

13:文章生成模組 13: Article generation module

131:影像描述資料庫 131: Image Description Database

132:地圖實景資料庫 132: Map real scene database

133:已知人物資料庫 133: Database of known characters

14:語音轉換模組 14: Voice conversion module

15:播放裝置 15:Playback device

16:收音元件 16: Radio components

17:定位元件 17: Positioning element

18:臉部辨識模組 18: Facial recognition module

2:導盲杖 2: Guide stick

21:第一控制部 21: First control unit

22:第二無線通訊元件 22: Second wireless communication component

23:第二控制部 23: Second control unit

第一圖係為本發明較佳實施例之立體透視圖。 The first figure is a three-dimensional perspective view of a preferred embodiment of the present invention.

第二圖係為本發明較佳實施例之結構方塊圖。 The second figure is a structural block diagram of a preferred embodiment of the present invention.

第三圖係為本發明較佳實施例之拍照示意圖。 The third figure is a photographic diagram of a preferred embodiment of the present invention.

第四圖係為本發明較佳實施例之影像辨識示意圖。 The fourth figure is a schematic diagram of image recognition of a preferred embodiment of the present invention.

第五圖係為本發明較佳實施例之語音播放示意圖。 Figure 5 is a schematic diagram of voice playback of a preferred embodiment of the present invention.

第六圖係為本發明再一較佳實施例之實施示意圖。 Figure 6 is a schematic diagram of another preferred embodiment of the present invention.

第七圖係為本發明又一較佳實施例之結構方塊圖。 Figure 7 is a structural block diagram of another preferred embodiment of the present invention.

第八圖係為本發明又一較佳實施例之精細定位示意圖。 Figure 8 is a schematic diagram of the fine positioning of another preferred embodiment of the present invention.

第九圖係為本發明另一較佳實施例之結構方塊圖。 Figure 9 is a structural block diagram of another preferred embodiment of the present invention.

第十圖係為本發明另一較佳實施例之人物辨識示意圖。 Figure 10 is a schematic diagram of character recognition of another preferred embodiment of the present invention.

為達成上述目的及功效，本發明所採用之技術手段及構造，茲繪圖就本發明較佳實施例詳加說明其特徵與功能如下，俾利完全了解。 In order to achieve the above-mentioned purpose and effect, the technical means and structure adopted by the present invention are described in detail in the following figure for the preferred embodiment of the present invention, and its features and functions are explained for complete understanding.

請參閱第一圖及第二圖所示，係為本發明較佳實施例之立體透視圖及結構方塊圖，由圖中可清楚看出本發明係包括： Please refer to the first and second figures, which are the three-dimensional perspective diagram and structural block diagram of the preferred embodiment of the present invention. It can be clearly seen from the figure that the present invention includes:

一頭戴式裝置1，其中該頭戴式裝置1係為眼鏡、墨鏡或護目鏡其中之一者； A head-mounted device 1, wherein the head-mounted device 1 is one of glasses, sunglasses or goggles;

一攝像元件11，係設於該頭戴式裝置1上，且拍照方向與該頭戴式裝置1之視線方向相同，以供取得一影像資訊，本實施例中該攝像元件11係以鏡頭作為舉例； An imaging element 11 is disposed on the head mounted device 1, and the photographing direction is the same as the sight direction of the head mounted device 1, so as to obtain image information. In this embodiment, the imaging element 11 is taken as an example of a lens;

一第一無線通訊元件111，係設於該頭戴式裝置1上且電性連結該攝像元件11； A first wireless communication element 111 is disposed on the head mounted device 1 and electrically connected to the imaging element 11;

一導盲杖2，係具有一供驅動該攝像元件11之第一控制部21、及一電性連結該第一控制部21且無線連結該第一無線通訊元件111之第二無線通訊元件22，本實施例中該第一控制部21係以按壓開關作為舉例，該第一、第二無線通訊元件111、22係以藍芽連結作為舉例； A guide stick 2 has a first control unit 21 for driving the imaging element 11, and a second wireless communication element 22 electrically connected to the first control unit 21 and wirelessly connected to the first wireless communication element 111. In this embodiment, the first control unit 21 is exemplified by a push switch, and the first and second wireless communication elements 111 and 22 are exemplified by Bluetooth connection;

一影像辨識模組12，係設於該頭戴式裝置1內且電性連結該攝像元件11，以從該影像資訊中辨識出複數個物品資訊，該影像辨識模組12係以物件偵測技術(Object Detection)執行，本實施例則以YOLO、或AI DIY Playform等其中之一者軟體作為舉例； An image recognition module 12 is disposed in the head mounted device 1 and electrically connected to the imaging element 11 to recognize multiple object information from the image information. The image recognition module 12 is implemented by object detection technology. This embodiment uses one of the software such as YOLO or AI DIY Playform as an example;

一文章生成模組13，係設於該頭戴式裝置1內且電性連結該影像辨識模組12，並具有一影像描述資料庫131，以根據該影像資訊及該些物品資訊產生一文意通暢之文字資訊，且該文章生成模組13係為由深度神經網絡模型構成之語言模型chatGPT； An article generation module 13 is disposed in the head mounted device 1 and electrically connected to the image recognition module 12, and has an image description database 131 to generate a coherent text message based on the image information and the object information, and the article generation module 13 is a language model chatGPT composed of a deep neural network model;

一語音轉換模組14，係設於該頭戴式裝置1內且電性連結該文章生成模組13，以將該文字資訊轉為語音訊息，本實施例係以文字轉語音技術(Text To Speech，TTS)進行轉換；及 A speech conversion module 14 is disposed in the head mounted device 1 and electrically connected to the article generation module 13 to convert the text information into a voice message. In this embodiment, the conversion is performed using text to speech technology (Text To Speech, TTS); and

一播放裝置15，係設於該頭戴式裝置1上且電性連結該語音轉換模組14，本實施例中該播放裝置15係以喇叭作為舉例。 A playback device 15 is disposed on the head mounted device 1 and electrically connected to the voice conversion module 14. In this embodiment, the playback device 15 is exemplified by a speaker.

藉由上述之說明，已可了解本技術之結構，而依據這個結構之對應配合，更可達到操作動作簡便隱蔽、及能快速辨識物品以自動口語說明眼前情境等優勢，而詳細之解說將於下述說明。 Through the above explanation, we can understand the structure of this technology. According to the corresponding coordination of this structure, we can achieve the advantages of simple and concealed operation, and can quickly identify objects to automatically explain the current situation. The detailed explanation will be given below.

請同時配合參閱第一圖至第五圖所示，係為本發明較佳實施例之立體透視圖至語音播放示意圖，藉由上述構件組構時，由圖中可清楚看出，本發明之實體裝備只有頭戴式裝置1及導盲杖2，與一般視障者的常用配備相同，因此視障者仍只要配戴頭戴式裝置1、手持導盲杖2即可實施，外觀上並無二致，接著透過第一無線通訊元件111與第二無線通訊元件22進行頭戴式裝置1與導盲杖2的藍芽配對連結，即可隨時進行辨識及轉換。 Please refer to the first to fifth figures at the same time, which are stereoscopic perspective diagrams and voice playback schematic diagrams of the preferred embodiments of the present invention. Through the above-mentioned component assembly, it can be clearly seen from the figure that the physical equipment of the present invention only has a head-mounted device 1 and a guide stick 2, which is the same as the common equipment of the general visually impaired. Therefore, the visually impaired only need to wear a head-mounted device 1 and hold a guide stick 2 to implement it. There is no difference in appearance. Then, the Bluetooth pairing of the head-mounted device 1 and the guide stick 2 is performed through the first wireless communication element 111 and the second wireless communication element 22, and identification and conversion can be performed at any time.

當使用者移動至任一定點，並想確認眼前的人事物景象時(例如眼前的一幅畫)，只要按壓導盲杖2上的第一控制部21，使其驅動攝像元件11以拍攝取得一靜態影像資訊，此動作對視障者而言只是動一下握持於導盲杖2的手指，動作簡單且隱蔽，旁人並無法看出此操作動作，此時影像辨識模組12會自動辨識出影像資訊中的物品資訊，其辨識原理包括幾個部分：影像掃描(Image scan)、物件特徵辨識、及輸出介面(Output interface)。具體而言，係先透過攝像元件11將待輸入之內容掃描成一個或一個以上的影像，再將影像送給影像辨識模組12以偵測影像中的物件外觀物理特徵，並將此辨識結果與雲端的物件特徵資料庫進行匹配，即可從物件特徵資料庫中選出對應物件名稱，而將該物件名稱作為物品資訊傳遞給文章生成模組13，例如：一片沙灘、一座涼亭、一座山、三隻鳥、兩艘船。 When the user moves to any fixed point and wants to confirm the people, things and scenes in front of him (such as a painting in front of him), he only needs to press the first control part 21 on the guide stick 2 to drive the camera element 11 to shoot and obtain static image information. For the visually impaired, this action is just to move the fingers holding the guide stick 2. The action is simple and hidden, and other people cannot see this operation action. At this time, the image recognition module 12 will automatically recognize the object information in the image information. Its recognition principle includes several parts: image scanning, object feature recognition, and output interface. Specifically, the content to be input is first scanned into one or more images through the camera element 11, and then the image is sent to the image recognition module 12 to detect the physical features of the object in the image, and the recognition result is matched with the object feature database in the cloud. The corresponding object name can be selected from the object feature database, and the object name is transmitted to the article generation module 13 as the object information, for example: a beach, a pavilion, a mountain, three birds, two boats.

再由文章生成模組13結合影像描述資料庫131中的資料，整合所有物品資訊產生文意通暢的文字資訊，其中該文章生成模組13係為由深度神經網絡模型構成之語言模型chatGPT，而所述GPT為基於轉換器的生成式預訓練模型(Generative pre-trained transformers，GPT)，是一種大型語言模型，也是生成式人工智慧的重要框架，並能夠處理文字和圖像的輸入，而後使用人類自然對話方式來以文字互動，還可以用於甚為複雜的語言工作，包括自動生成文字、自動問答、自動摘要等多種任務，故只要輸入一些文字或影像，文章生成模組13便能自動拼湊出一段語意通暢的文字資訊，在2023年推出的GPT-4模型版本中，更可以進一步適應特定任務和/或主題領域，形成更具針對性的系統，例如將本發明之影像描述資料庫131納入訓練模型中，使文章生成模組13更擅長針對影像資訊的內容去生成一段用於描述影像內容的文字資訊，以本實施例舉例，該影像資訊為一幅畫，並整合影像辨識模組12提供的物品資訊，而生成一段如「畫面為一幅山水畫，畫中場景為海灘，海灘與海水的盡頭有一座高山，海灘上有座涼亭，海上有兩艘帆船，天空中有三隻鳥」之文字資訊。 The article generation module 13 then combines the data in the image description database 131 to integrate all the object information to generate coherent text information. The article generation module 13 is a language model chatGPT composed of a deep neural network model, and the GPT is a generative pre-trained transformer-based model (GPT), which is a large language model and an important framework for generative artificial intelligence. It can process text and image input, and then use human natural dialogue to interact with text. It can also be used for very complex language work, including automatic text generation, automatic question answering, automatic summarization and other tasks. Therefore, as long as some text or images are input, the article generation module 13 can automatically piece together a piece of coherent text information. In the GPT-4 model version to be launched in 2023, it can be further adapted to specific tasks and / or subject area to form a more targeted system. For example, the image description database 131 of the present invention is incorporated into the training model, so that the article generation module 13 is better at generating a paragraph of text information for describing the image content according to the content of the image information. For example, in this embodiment, the image information is a painting, and the object information provided by the image recognition module 12 is integrated to generate a paragraph of text information such as "the picture is a landscape painting, the scene in the painting is a beach, there is a high mountain at the end of the beach and the sea, there is a pavilion on the beach, there are two sailboats on the sea, and there are three birds in the sky".

最後利用語音轉換模組14將文字資訊轉換為語音訊息，由播放裝置15播放給使用者聽，其中語音轉換模組14的轉換原理包括本文處理及語音合成兩個步驟，第一，本文處理乃進行語言分析、確定單詞邊界、及斷句操作，此步驟可提高語音合成時的品質及流暢度，第二，語音合成乃根據分析後的本文內容，利用特定的算法和規則將其生成語音訊號，或是利用在資料庫內的許多已錄好的語音連接起來。由於輸入內容為單一張照片，輸入條件較為單純，使影像辨識模組12及文章生成模組13的處理速度較快、負擔較低，可更快速的完成辨識轉換動作，對使用者而言，則得以簡單隱蔽的操作方式，輔助使用者對周遭環境視覺辨識力的不足。 Finally, the speech conversion module 14 is used to convert the text information into a speech message, which is played to the user by the playback device 15. The conversion principle of the speech conversion module 14 includes two steps: text processing and speech synthesis. First, text processing is to perform language analysis, determine word boundaries, and perform sentence segmentation operations. This step can improve the quality and fluency of speech synthesis. Second, speech synthesis is to generate a speech signal based on the analyzed text content using specific algorithms and rules, or to connect many recorded voices in the database. Since the input content is a single photo, the input conditions are relatively simple, so the processing speed of the image recognition module 12 and the article generation module 13 is faster and the burden is lower, and the recognition conversion action can be completed more quickly. For the user, a simple and concealed operation method is provided to assist the user's lack of visual recognition of the surrounding environment.

請同時配合參閱第六圖所示，係為本發明再一較佳實施例之實施示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1上設有一電性連結該文章生成模組13之收音元件16，係供接收使用者之語音指令，以供該文章生成模組13調整該文字資訊之內容，並該導盲杖2上設有一第二控制部23，係供驅動該收音元件16，其中收音元件16係為微型麥克風、第二控制部23為設於第一控制部21一側的按壓開關。當使用者認為文章生成模組13所產生之文字資訊內容不夠清楚時，可利用第二控制部23啟動收音元件16，讓使用者以口述方式輸入語音指令，以對文章生成模組13輸入更具體的條件，例如「請問那些東西的位置關係為何？」，此時文章生成模組13即可再根據影像資訊中各該物品資訊的位置關係重新生成一段文字資訊「場景下方右邊為海灘，下方左邊為海水，高山從右邊延伸至海水盡頭的中央，涼亭在山腳下的海灘上，兩艘帆船大約在近攤處，三隻鳥在帆船與高山間的低空位置」。如此一來，使用者可藉由收音元件16與文章生成模組13互動，而取得滿意的回覆。 Please refer to FIG. 6 at the same time, which is a schematic diagram of another preferred embodiment of the present invention. It can be clearly seen from the figure that the present embodiment is similar to the above-mentioned embodiment, except that the head-mounted device 1 is provided with a sound receiving element 16 electrically connected to the article generating module 13 for receiving the user's voice command so that the article generating module 13 can adjust the content of the text information, and the guide stick 2 is provided with a second control unit 23 for driving the sound receiving element 16, wherein the sound receiving element 16 is a miniature microphone and the second control unit 23 is a push switch arranged on one side of the first control unit 21. When the user thinks that the text information generated by the article generation module 13 is not clear enough, the second control unit 23 can be used to activate the sound receiving element 16, allowing the user to input voice commands orally to input more specific conditions to the article generation module 13, such as "What is the position relationship of those things?" At this time, the article generation module 13 can regenerate a piece of text information based on the position relationship of each item information in the image information, "The lower right side of the scene is the beach, the lower left side is the sea, the mountain extends from the right to the center of the end of the sea, the pavilion is on the beach at the foot of the mountain, the two sailboats are approximately near the stall, and the three birds are in the low-altitude position between the sailboats and the mountains." In this way, the user can interact with the article generation module 13 through the sound receiving element 16 and obtain a satisfactory response.

請同時配合參閱第七圖及第八圖所示，係為本發明又一較佳實施例之結構方塊圖及精細定位示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1內具有一電性連結該文章生成模組13之定位元件17，係供產生一位置資訊，且該文章生成模組13乃具有一地圖實景資料庫132，以供該文章生成模組13整合該影像資訊、該物品資訊、該位置資訊，產生一精細定位資訊，其中定位元件17為GPS定位系統，地圖實景資料庫132為儲存有地圖街景圖資的資料庫。由於定位元件17係透過衛星定位，即使能夠知道使用者所在的大概位置，但一般只能掌握到某路段大約100公尺的範圍內，尤其使用者在非移動狀態下處於定點位置時，並無法判斷使用者的方位，因此，當使用按壓第二控制部23啟動收音元件16，使其收到位置判斷指令時，文章生成模組13便會整合影像資訊、物品資訊、位置資訊，搜尋地圖實景資料庫132中的街景資料，來判斷使用者前方街景的精細位置，以輸出關於精細位置的文字資訊。 Please refer to FIG7 and FIG8 at the same time, which are a structural block diagram and a precise positioning diagram of another preferred embodiment of the present invention. It can be clearly seen from the figure that the present embodiment is similar to the above-mentioned embodiment, except that the head-mounted device 1 has a positioning element 17 electrically connected to the article generation module 13 for generating location information, and the article generation module 13 has a map real scene database 132 for the article generation module 13 to integrate the image information, the object information, and the location information to generate precise positioning information, wherein the positioning element 17 is a GPS positioning system, and the map real scene database 132 is a database storing map street view data. Since the positioning element 17 is based on satellite positioning, even if the user's approximate location can be known, it can generally only grasp the range of about 100 meters of a certain road section. In particular, when the user is in a fixed position without moving, the user's position cannot be determined. Therefore, when the user presses the second control unit 23 to activate the sound receiving element 16 so that it receives the position determination instruction, the article generation module 13 will integrate the image information, object information, and location information, search the street view data in the map real scene database 132, and determine the precise location of the street view in front of the user to output text information about the precise location.

舉例而言，定位元件17可定位出使用者位於XX路1號的100公尺範圍內，影像辨識模組12判斷出眼前為座落於三角窗的XX咖啡店，再整合地圖實景資料庫132即可得知XX路1號的100公尺範圍內座落於三角窗的XX咖啡店之具體地址為何，而取得一精細定位資訊及當前面對之方位，不但有利於使用者對所在位置的認知，也可在迷路時提供親友自身所在位置的正確資訊。 For example, the positioning element 17 can locate the user within 100 meters of No. 1 XX Road, and the image recognition module 12 determines that the user is in front of the XX coffee shop located in the triangular window. Then, the map real scene database 132 is integrated to know the specific address of the XX coffee shop located in the triangular window within 100 meters of No. 1 XX Road, and obtain a precise positioning information and the current direction, which is not only conducive to the user's recognition of the location, but also can provide relatives and friends with correct information about their own location when they are lost.

請同時配合參閱第九圖及第十圖所示，係為本發明另一較佳實施例之結構方塊圖及人物辨識示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1內具有一臉部辨識模組18、及一電性連結該臉部辨識模組18及該文章生成模組13之已知人物資料庫133，以供該文章生成模組13針對該影像資訊中的人物調整該文字資訊之內容，另本實施例之播放裝置15係以耳機之態樣作為舉例，該臉部辨識模組18係以人臉辨識軟體做為舉例，例如FaceMe、或FaceMaze其中之一者，所述人臉辨識是生物辨識技術的一種，其運作原理係以向量方式擷取臉部特徵值，並與事先登錄的臉孔之特徵值進行比對，進而透過深度神經網路，以演算法及數學算式量測人臉的各項變數、化為特徵值，再比對資料庫以找出該人臉之正確身分。具體而言，人臉辨識的動作包含有三個步驟：臉部偵測、臉部特徵值擷取及臉部辨識，第一，臉部偵測，透過臉部偵測技術，即使僅局部的臉部出現於畫面之中，仍可於影像或影片中精準掃描、偵測及框列人臉之所在位置；第二，臉部特徵值擷取，利用臉部辨識引擎將框列出的臉部區分成n個維度，例如，高精準度的臉部辨識引擎之n值為1024時，可將臉部切分成1024維度的矩陣，將例如：鼻子的長度與寬度、額頭寬度、眼睛形狀等各項變數擷取出以向量為基礎的臉部特徵值；第三，臉部辨識，將臉部特徵值與資料庫中預先登錄的人臉進行特徵值比對，識別出正確身分，以1：N比對為例，是以在畫面中出現的臉部特徵值，與資料庫中N個預先登錄的臉部進行比對，識別出身分。 Please refer to FIG. 9 and FIG. 10 at the same time, which are a structural block diagram and a person recognition schematic diagram of another preferred embodiment of the present invention. It can be clearly seen from the figure that the present embodiment is similar to the above-mentioned embodiment, except that the head mounted device 1 has a face recognition module 18 and a known person database 133 electrically connected to the face recognition module 18 and the article generation module 13, so that the article generation module 13 can adjust the content of the text information according to the person in the image information. In addition, the playback of the present embodiment is The device 15 is in the form of a headset, and the facial recognition module 18 is in the form of a facial recognition software, such as FaceMe or FaceMaze. Facial recognition is a type of biometric recognition technology. Its operating principle is to capture facial feature values in a vector form and compare them with the feature values of pre-registered faces. Then, through a deep neural network, algorithms and mathematical formulas are used to measure various variables of the face, convert them into feature values, and then compare them with the database to find the correct identity of the face. Specifically, the process of face recognition includes three steps: face detection, facial feature value extraction and face recognition. First, face detection. Through facial detection technology, even if only a part of the face appears in the picture, the location of the face can be accurately scanned, detected and framed in the image or video. Second, facial feature value extraction uses the face recognition engine to divide the framed face into n dimensions. For example, the n value of a high-precision face recognition engine is 1024. When the face is divided into a 1024-dimensional matrix, the variables such as the length and width of the nose, the width of the forehead, and the shape of the eyes are extracted into vector-based facial feature values; third, facial recognition, the facial feature values are compared with the pre-registered faces in the database to identify the correct identity. For example, 1:N comparison is to compare the facial feature values appearing on the screen with N pre-registered faces in the database to identify the identity.

當使用者利用第一控制部21拍照，且影像辨識模組12辨識出影像資訊中有人物存在時，便會自動連結到臉部辨識模組18，以將該人物的人臉資訊比對已知人物資料庫133中的人臉資訊，當出現比對符合的人臉資訊時，乃將物品資訊中較攏統的人物敘述(如「一個女人」)變更為該人物的名字或稱謂，以供文章生成模組13調整其文字資訊予播放裝置15，例如調整為「姊姊迎面而來」，而若已知人物資料庫133中無比對符合的人臉資訊時，則判定該人物為陌生人，其輸出的物品資訊將維持原本攏統的人物敘述。藉此，不但可幫助使用者對眼前人物作辨識，也可避免多餘的文字資訊。 When the user uses the first control unit 21 to take a photo, and the image recognition module 12 recognizes that there is a person in the image information, it will automatically connect to the facial recognition module 18 to compare the facial information of the person with the facial information in the known person database 133. When the facial information that matches the match appears, the more general character description in the item information (such as "a woman") is changed to the name or title of the person, so that the article generation module 13 can adjust its text information to the playback device 15, for example, to "sister is coming towards me". If there is no facial information that matches the match in the known person database 133, the person is determined to be a stranger, and the output item information will maintain the original general character description. In this way, it can not only help the user to identify the person in front of him, but also avoid redundant text information.

惟，以上所述僅為本發明之較佳實施例而已，非因此即侷限本發明之專利範圍，故舉凡運用本發明說明書及圖式內容所為之簡易修飾及等效結構變化，均應同理包含於本發明之專利範圍內，合予陳明。 However, the above is only a preferred embodiment of the present invention, and does not limit the patent scope of the present invention. Therefore, all simple modifications and equivalent structural changes made by using the contents of the present invention's specification and drawings should be included in the patent scope of the present invention and should be stated.

是以，本發明之影像轉語音之視障輔助裝置為可改善習用之技術關鍵在於： Therefore, the key technology of the image-to-speech assistive device for the visually impaired that can improve learning lies in:

第一，文章生成模組13乃由深度神經網絡模型構成的語言模型chatGPT，係以AI人工智慧技術驅動自然語言處理技術的升級版聊天機器人，可自我學習、自主完善，模擬人類進行較自然流暢的對話，因此結合影像辨識模組12能產生以文意通暢的句子描述影像資訊的內容，而利於使用者理解。 First, the article generation module 13 is a language model chatGPT composed of a deep neural network model. It is an upgraded chatbot that uses AI artificial intelligence technology to drive natural language processing technology. It can self-learn and improve itself, and simulate humans to have a more natural and fluent conversation. Therefore, combined with the image recognition module 12, it can generate sentences that describe the content of the image information in a fluent manner, which is conducive to user understanding.

第二，以導盲杖2上的第一控制部21驅動頭戴式裝置1上的攝像元件11，可僅在需要的時候的發出辨識請求，操作上更為簡單、隱蔽。 Second, the first control unit 21 on the guide stick 2 drives the imaging element 11 on the head mounted device 1, and the identification request can be issued only when necessary, which is simpler and more concealed in operation.

第三，輸入的內容為使用者半自動拍攝的單張靜態照片，輸入條件較為單純，使影像辨識模組12及文章生成模組13的處理速度較快、負擔較低，可更快速的完成辨識轉換動作。 Third, the input content is a single still photo taken semi-automatically by the user, and the input conditions are relatively simple, so that the processing speed of the image recognition module 12 and the article generation module 13 is faster and the burden is lower, and the recognition conversion action can be completed more quickly.

第四，利用收音元件16的設計，當使用者認為文章生成模組13所產生之文字資訊內容不夠清楚時，使用者得以口述方式輸入語音指令，以輸入更具體的條件，而藉由收音元件16與文章生成模組13互動，而取得滿意的回覆。 Fourth, by utilizing the design of the sound receiving element 16, when the user believes that the text information generated by the article generation module 13 is not clear enough, the user can orally input voice commands to input more specific conditions, and obtain a satisfactory response through the interaction between the sound receiving element 16 and the article generation module 13.

第五，利用定位元件17及地圖實景資料庫132的設計，可取得一精細定位資訊及當前面對之方位，不但有利於使用者對所在位置的認知，也可在迷路時提供親友自身所在位置的正確資訊。 Fifth, by utilizing the design of the positioning element 17 and the map real scene database 132, a precise positioning information and the current direction can be obtained, which is not only helpful for the user to recognize the location, but also can provide accurate information about the location of relatives and friends when they are lost.

第六，利用臉部辨識模組18及已知人物資料庫133的設計，不但可幫助使用者對眼前人物作辨識，也可避免針對陌生人生成多餘的文字資訊。 Sixth, the design of the facial recognition module 18 and the known person database 133 can not only help the user to identify the person in front of him, but also avoid generating redundant text information for strangers.

綜上所述，本發明之影像轉語音之視障輔助裝置於使用時，為確實能達到其功效及目的，故本發明誠為一實用性優異之發明，為符合發明專利之申請要件，爰依法提出申請，盼審委早日賜准本發明，以保障申請人之辛苦發明，倘若鈞局審委有任何稽疑，請不吝來函指示，申請人定當竭力配合，實感德便。 In summary, the image-to-speech assistive device for the visually impaired of this invention can truly achieve its efficacy and purpose when used. Therefore, this invention is truly an invention with excellent practicality. In order to meet the application requirements for invention patents, an application is filed in accordance with the law. I hope that the review committee will approve this invention as soon as possible to protect the applicant's hard work. If the review committee of the Jun Bureau has any doubts, please feel free to write to give instructions. The applicant will do his best to cooperate and feel very grateful.

1:頭戴式裝置 1: Head-mounted device

11:攝像元件 11: Imaging components

111:第一無線通訊元件 111: First wireless communication element

12:影像辨識模組 12: Image recognition module

13:文章生成模組 13: Article generation module

131:影像描述資料庫 131: Image Description Database

14:語音轉換模組 14: Voice conversion module

15:播放裝置 15:Playback device

2:導盲杖 2: Guide stick

21:第一控制部 21: First control unit

22:第二無線通訊元件 22: Second wireless communication component

Claims

A visually impaired assistive device for converting images to speech, which mainly includes:

A head mounted device;

A camera element is disposed on the head-mounted device, and the camera direction is the same as the sight direction of the head-mounted device, so as to obtain image information;

A first wireless communication element is disposed on the head mounted device and electrically connected to the imaging element;

A guide stick for the blind, comprising a first control unit for driving the imaging element, and a second wireless communication element electrically connected to the first control unit and wirelessly connected to the first wireless communication element;

An image recognition module is disposed in the head mounted device and electrically connected to the imaging element to recognize multiple object information from the image information;

An article generation module is disposed in the head mounted device and electrically connected to the image recognition module, and has an image description database to generate a coherent text message based on the image information and the object information, and the article generation module is a language model chatGPT composed of a deep neural network model;

A speech conversion module is disposed in the head mounted device and is electrically connected to the article generation module to convert the text information into a speech message; and

A playback device is disposed on the head-mounted device and electrically connected to the voice conversion module.

As described in Item 1 of the patent application scope, the image-to-speech assistive device for the visually impaired, wherein the head-mounted device is one of glasses, sunglasses or goggles.

As described in item 1 of the patent application scope, the visually impaired assistive device for converting images into speech, wherein the head-mounted device is provided with a sound receiving element electrically connected to the article generation module for receiving the user's voice commands so that the article generation module can adjust the content of the text information.

As described in item 3 of the patent application scope, the visually impaired assistive device for converting images into speech, wherein the guide stick is provided with a second control unit for driving the sound receiving element.

As described in item 3 of the patent application scope, the visually impaired assistive device for converting images into speech, wherein the head-mounted device has a positioning element electrically connected to the article generation module for generating location information, and the article generation module has a map real scene database for the article generation module to integrate the image information, the object information, and the location information to generate precise positioning information.

As described in item 1 of the patent application scope, the visually impaired assistive device for converting images into speech, wherein the head-mounted device has a facial recognition module and a known person database electrically connected to the facial recognition module and the article generation module, so that the article generation module can adjust the content of the text information according to the person in the image information.

As described in Item 1 of the patent application scope, the visually impaired assistive device for converting images into speech, wherein the playback device is one of a speaker or a headset.