TWM648987U

TWM648987U - Image-to-speech assistive device for visually impaired

Info

Publication number: TWM648987U
Application number: TW112208257U
Authority: TW
Inventors: 蘇家榮; 游靖晧
Original assignee: 上弘醫療設備股份有限公司
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-12-01

Abstract

本新型為有關一種影像轉語音之視障輔助裝置，主要結構包括一具有攝像元件之頭戴式裝置，其上設有一第一無線通訊元件、一影像辨識模組、一文章生成模組、一語音轉換模組、及一播放裝置，另具有一導盲杖，該導盲杖具有一供驅動攝像元件之第一控制部、及一無線連結第一無線通訊元件之第二無線通訊元件。藉此，視障人員只要利用導盲杖上的第一控制部驅動攝像元件，使其拍攝眼前的景象，而後經過物件辨識技術與chatGPT的人工智慧自然語言處理技術，將影像資訊轉為文字，並透過播放裝置語音輸出給視障人員，以達到操作方便，快速辨識，自動口語說明眼前情境之功效。 This new model relates to an image-to-speech assistive device for the visually impaired. The main structure includes a head-mounted device with a camera element, on which a first wireless communication element, an image recognition module, an article generation module, and an The voice conversion module and a playback device also have a guide stick. The guide stick has a first control part for driving the camera component and a second wireless communication component that is wirelessly connected to the first wireless communication component. With this, visually impaired persons only need to use the first control part on the guide cane to drive the camera element to capture the scene in front of them, and then use object recognition technology and chatGPT's artificial intelligence natural language processing technology to convert the image information into text. And through the voice output of the playback device to the visually impaired, it can achieve the functions of easy operation, quick recognition, and automatic spoken explanation of the situation in front of them.

Description

Image-to-speech assistive device for the visually impaired

本新型為提供一種操作動作簡便隱蔽、能快速辨識物品以自動口語說明眼前情境的影像轉語音之視障輔助裝置。 This new model provides an image-to-speech auxiliary device for the visually impaired that has simple and concealed operation, can quickly identify objects, and automatically explain the situation in front of you in spoken language.

按，視覺障礙者是指在視覺器官的構造或機能發生部分或全部障礙的人，其對外界事物無法或較難辨識，往往需要視覺輔助裝置進行輔助以完成日常生活中的相關行為。 According to the definition, visually impaired people refer to people who have partial or complete impairment in the structure or function of their visual organs. They are unable or have difficulty identifying external objects and often require the assistance of visual aids to complete related behaviors in daily life.

隨著近年來的科技進步，許多功能透過數位化的技術實踐，應用於輔助裝置上也能給予使用者更多的協助，在身障以外的使用者也能提供諸多協助。目前將輔具數位化的應用較少，僅有少數結合影像辨識技術者，但是對於複雜的人工智慧運算，需要比較大型的處理器及龐大的資料庫，這些大型系統不適合隨身攜帶，且由於人工智慧在處理龐大的影像資訊時，其反應時間並無法跟上使用者需求，使得視覺障礙人員無法有一個友善的外出輔具，或使用上較不便利。 With the advancement of technology in recent years, many functions have been implemented through digital technology and applied to assistive devices to provide users with more assistance. They can also provide a lot of assistance to users other than those with disabilities. At present, there are few applications for digitizing assistive devices, and only a few combine image recognition technology. However, for complex artificial intelligence calculations, relatively large processors and huge databases are needed. These large systems are not suitable for carrying around, and due to manual When processing huge amounts of image information, the response time of smart devices cannot keep up with the needs of users, making it impossible for visually impaired people to have a friendly outdoor assistive device, or making it less convenient to use.

以中華民國專利第M584676號「數位輔具」為例，於使用時仍存在下列問題與缺失尚待改進： Taking the Republic of China Patent No. M584676 "Digital Assistive Device" as an example, there are still the following problems and deficiencies that need to be improved during use:

第一，其攝像鏡頭結合於穿戴配件上，隨時進行前方影像的辨識及語音回饋，但使用者並非一直需要語音輔助，且無法控制攝像鏡頭的動作時機，導致使用者耳朵接收訊息繁雜，甚至掩蓋周圍的其他聲音，反而更危險。 First, its camera lens is combined with a wearable accessory to provide front image recognition and voice feedback at any time. However, the user does not always need voice assistance, and cannot control the movement timing of the camera lens, causing the user's ears to receive information that is complicated or even obscured. The other sounds around him were even more dangerous.

第二，其攝像鏡頭為高畫質鏡頭，且不斷進行動態拍攝，對運算裝置的負擔極大，而影響語音回饋速度。 Second, the camera lens is a high-definition lens and is constantly taking dynamic shots, which puts a huge burden on the computing device and affects the speed of voice feedback.

第三，其語意輸出乃基於Tesseract及TTS(Text to Speech)技術，分析文字排列所產生的語意，進而轉換為語音輸出，此二技術並未使用人工智慧的自然語言處理技術，僅能以預錄的字詞拼湊，所生成之語意較為生硬。 Third, its semantic output is based on Tesseract and TTS (Text to Speech) technology, which analyzes the semantics generated by text arrangement and then converts it into speech output. These two technologies do not use artificial intelligence natural language processing technology and can only predict The recorded words are pieced together, and the resulting semantic meaning is rather blunt.

是以，要如何解決上述習用之問題與缺失，即為本新型之申請人與從事此行業之相關廠商所亟欲研究改善之方向所在者。 Therefore, how to solve the above-mentioned conventional problems and deficiencies is the direction that the applicant of the present invention and relevant manufacturers engaged in this industry are eager to study and improve.

故，本新型之申請人有鑑於上述缺失，乃蒐集相關資料，經由多方評估及考量，並以從事於此行業累積之多年經驗，經由不斷試作及修改，始設計出此種操作動作簡便隱蔽、能快速辨識物品以自動口語說明眼前情境的影像轉語音之視障輔助裝置的新型專利者。 Therefore, in view of the above shortcomings, the applicant of this new model collected relevant information, evaluated and considered it from many parties, and used his many years of experience in this industry, and through continuous trials and modifications, he designed such a simple and concealed operation. The patentee of a new type of visually impaired assistive device that can quickly identify objects and automatically explain the situation in front of you in spoken language.

本新型之主要目的在於：文章生成模組乃由深度神經網絡模型構成的語言模型chatGPT，係以AI人工智慧技術驅動自然語言處理技術的升級版聊天機器人，可自我學習、自主完善，模擬人類進行較自然流暢的對話，因此結合影像辨識模組能產生以文意通暢的句子描述影像資訊的內容，而利於使用者理解。 The main purpose of this new model is: the article generation module is a language model chatGPT composed of a deep neural network model. It is an upgraded version of the chat robot driven by AI artificial intelligence technology and natural language processing technology. It can learn and improve independently, simulating human processes. A more natural and smooth dialogue, so combined with the image recognition module, it can generate content that describes the image information in smooth sentences, which is easier for users to understand.

本新型之另一主要目的在於：以導盲杖上的第一控制部驅動頭戴式裝置上的攝像元件，可僅在需要的時候的發出辨識請求，操作上更為簡單、隱蔽。 Another main purpose of the present invention is to use the first control part on the guide stick to drive the camera element on the head-mounted device, so that identification requests can be issued only when needed, making the operation simpler and more concealed.

為達成上述目的，本新型之主要結構包括：一具有攝像元件之頭戴式裝置、一第一無線通訊元件、一影像辨識模組、一文章生成模組、一語音轉換模組、及一播放裝置，另具有一導盲杖，該導盲杖具有一第一控制部、及一第二無線通訊元件，其中該攝像元件設於頭戴式裝置上，且拍照方向與頭戴式裝置之視線方向相同，以供取得一影像資訊，該第一無線通訊元件設於頭戴式裝置上且電性連結攝像元件，該第一控制部供驅動攝像元件，該第二無線通訊元件電性連結第一控制部且無線連結第一無線通訊元件，該影像辨識模組設於頭戴式裝置內且電性連結攝像元件，以從影像資訊中辨識出複數個物品資訊，該文章生成模組設於頭戴式裝置內且電性連結影像辨識模組，並具有一影像描述資料庫，以根據些物品資訊產生一文意通暢之文字資訊，且文章生成模組係為由深度神經網絡模型構成之語言模型chatGPT，該語音轉換模組設於頭戴式裝置內且電性連結文章生成模組，以將該文字資訊轉為語音訊息，而該播放裝置設於頭戴式裝置上且電性連結語音轉換模組。 In order to achieve the above purpose, the main structure of the present invention includes: a head-mounted device with a camera element, a first wireless communication element, an image recognition module, an article generation module, a voice conversion module, and a playback module. The device also has a guide stick, which has a first control part and a second wireless communication element, wherein the camera element is installed on the head-mounted device, and the photographing direction is consistent with the line of sight of the head-mounted device. The directions are the same to obtain an image information. The first wireless communication element is provided on the head-mounted device and is electrically connected to the camera element. The first control part is used to drive the camera element. The second wireless communication element is electrically connected to the second wireless communication element. A control unit is wirelessly connected to the first wireless communication element. The image recognition module is located in the head-mounted device and is electrically connected to the camera element to identify a plurality of item information from the image information. The article generation module is located in The head-mounted device is electrically connected to the image recognition module and has an image description database to generate text information with clear meaning based on some item information, and the article generation module is a language composed of a deep neural network model. Model chatGPT, the speech conversion module is located in the head-mounted device and is electrically connected to the article generation module to convert the text information into a voice message, and the playback device is located on the head-mounted device and is electrically connected to the voice message Conversion module.

俾當使用者將本新型作為視障輔助裝置時，只要配戴頭戴式裝置、手持導盲杖，並透過第一無線通訊元件與第二無線通訊元件進行頭戴式裝置與導盲杖的無線連結，即可隨時進行辨識及轉換。當使用者移動至任一定點，並想確認眼前的人事物景象時，只要利用導盲杖上的第一控制部驅動攝像元件，以拍攝取得一靜態影像資訊，此時影像辨識模組會自動辨識出影像資訊中的物品資訊，並由文章生成模組結合影像描述資料庫中的資料，整合影像資訊及所有物品資訊產生文意思通暢的文字資訊，最後利用語音轉換模組將文字資訊轉換為語音訊息，由播放裝置播放給使用者聽。藉此，得以簡單隱蔽的操作方式，輔助使用者對周遭環境視覺辨識力的不足。 So that when the user uses this new model as a visually impaired assistive device, he only needs to wear a head-mounted device , holding the guide cane, and wirelessly connecting the head-mounted device and the guide cane through the first wireless communication element and the second wireless communication element, so that identification and conversion can be performed at any time. When the user moves to any certain point and wants to confirm the people and things in front of him, he only needs to use the first control part on the guide cane to drive the camera element to capture a static image information. At this time, the image recognition module will automatically The item information in the image information is identified, and the article generation module combines the data in the image description database to integrate the image information and all item information to generate text information with smooth meaning. Finally, the speech conversion module is used to convert the text information into The voice message is played to the user by the playback device. In this way, a simple and concealed operation method can be used to assist users with insufficient visual recognition of the surrounding environment.

藉由上述技術，可針對習用數位輔具所存在之無法控制辨識時機、語音訊息繁雜影響正常聽力、動態影像辨識導致負擔大效率低、及語意輸出未介入人工智慧較為生硬等問題點加以突破，達到上述優點之實用進步性。 Through the above technology, breakthroughs can be made to overcome the problems existing in conventional digital assistive devices such as the inability to control the recognition timing, complicated voice messages affecting normal hearing, dynamic image recognition causing high burden and low efficiency, and semantic output without involving artificial intelligence, which is more blunt. Practical progress to achieve the above advantages.

1:頭戴式裝置 1:Head mounted device

11:攝像元件 11:Camera components

111:第一無線通訊元件 111:The first wireless communication component

12:影像辨識模組 12:Image recognition module

13:文章生成模組 13: Article generation module

131:影像描述資料庫 131: Image description database

132:地圖實景資料庫 132: Map real scene database

133:已知人物資料庫 133:Known character database

14:語音轉換模組 14: Voice conversion module

15:播放裝置 15:Playback device

16:收音元件 16: Radio components

17:定位元件 17: Positioning components

18:臉部辨識模組 18: Facial recognition module

2:導盲杖 2:Guide cane

21:第一控制部 21:First Control Department

22:第二無線通訊元件 22: Second wireless communication component

23:第二控制部 23:Second Control Department

第一圖係為本新型較佳實施例之立體透視圖。 The first figure is a three-dimensional perspective view of the preferred embodiment of the present invention.

第二圖係為本新型較佳實施例之結構方塊圖。 The second figure is a structural block diagram of a preferred embodiment of the present invention.

第三圖係為本新型較佳實施例之拍照示意圖。 The third figure is a schematic diagram of a preferred embodiment of the present invention.

第四圖係為本新型較佳實施例之影像辨識示意圖。 The fourth figure is a schematic diagram of image recognition according to a preferred embodiment of the present invention.

第五圖係為本新型較佳實施例之語音播放示意圖。 The fifth figure is a schematic diagram of voice playback according to the preferred embodiment of the present invention.

第六圖係為本新型再一較佳實施例之實施示意圖。 Figure 6 is a schematic diagram of another preferred embodiment of the present invention.

第七圖係為本新型又一較佳實施例之結構方塊圖。 Figure 7 is a structural block diagram of another preferred embodiment of the present invention.

第八圖係為本新型又一較佳實施例之精細定位示意圖。 Figure 8 is a schematic diagram of fine positioning of another preferred embodiment of the present invention.

第九圖係為本新型另一較佳實施例之結構方塊圖。 Figure 9 is a structural block diagram of another preferred embodiment of the present invention.

第十圖係為本新型另一較佳實施例之人物辨識示意圖。 Figure 10 is a schematic diagram of person recognition according to another preferred embodiment of the present invention.

為達成上述目的及功效，本新型所採用之技術手段及構造，茲繪圖就本新型較佳實施例詳加說明其特徵與功能如下，俾利完全了解。 In order to achieve the above-mentioned purposes and effects, the technical means and structures adopted by the present invention are described in detail below with respect to the preferred embodiments of the present invention in order to facilitate a complete understanding.

請參閱第一圖及第二圖所示，係為本新型較佳實施例之立體透視圖及結構方塊圖，由圖中可清楚看出本新型係包括： Please refer to the first and second figures, which are three-dimensional perspective views and structural block diagrams of preferred embodiments of the present invention. From the figures, it can be clearly seen that the present invention includes:

一頭戴式裝置1，其中該頭戴式裝置1係為眼鏡、墨鏡或護目鏡其中之一者； A head-mounted device 1, wherein the head-mounted device 1 is one of glasses, sunglasses or goggles;

一攝像元件11，係設於該頭戴式裝置1上，且拍照方向與該頭戴式裝置1之視線方向相同，以供取得一影像資訊，本實施例中該攝像元件11係以鏡頭作為舉例； A camera element 11 is installed on the head-mounted device 1, and the photographing direction is the same as the line of sight direction of the head-mounted device 1, in order to obtain an image information. In this embodiment, the camera element 11 is a lens. Example;

一第一無線通訊元件111，係設於該頭戴式裝置1上且電性連結該攝像元件11； A first wireless communication component 111 is provided on the head-mounted device 1 and is electrically connected to the camera component 11;

一導盲杖2，係具有一供驅動該攝像元件11之第一控制部21、及一電性連結該第一控制部21且無線連結該第一無線通訊元件111之第二無線通訊元件22，本實施例中該第一控制部21係以按壓開關作為舉例，該第一、第二無線通訊元件111、22係以藍芽連結作為舉例； A guide stick 2 has a first control part 21 for driving the camera element 11, and a second wireless communication element 22 that is electrically connected to the first control part 21 and wirelessly connected to the first wireless communication element 111. , in this embodiment, the first control part 21 takes a push switch as an example, and the first and second wireless communication components 111 and 22 take a Bluetooth connection as an example;

一影像辨識模組12，係設於該頭戴式裝置1內且電性連結該攝像元件11，以從該影像資訊中辨識出複數個物品資訊，該影像辨識模組12係以物件偵測技術(Object Detection)執行，本實施例則以YOLO、或AI DIY Playform等其中之一者軟體作為舉例； An image recognition module 12 is provided in the head-mounted device 1 and is electrically connected to the camera element 11 to identify a plurality of item information from the image information. The image recognition module 12 is based on object detection. Technology (Object Detection) is executed. This embodiment uses one of the softwares such as YOLO or AI DIY Playform as an example;

一文章生成模組13，係設於該頭戴式裝置1內且電性連結該影像辨識模組12，並具有一影像描述資料庫131，以根據該影像資訊及該些物品資訊產生一文意通暢之文字資訊，且該文章生成模組13係為由深度神經網絡模型構成之語言模型chatGPT； An article generation module 13 is provided in the head-mounted device 1 and is electrically connected to the image recognition module 12, and has an image description database 131 to generate an article based on the image information and the item information. Smooth text information, and the article generation module 13 is a language model chatGPT composed of a deep neural network model;

一語音轉換模組14，係設於該頭戴式裝置1內且電性連結該文章生成模組13，以將該文字資訊轉為語音訊息，本實施例係以文字轉語音技術(Text To Speech，TTS)進行轉換；及 A voice conversion module 14 is provided in the head-mounted device 1 and is electrically connected to the article generation module 13 to convert the text information into a voice message. This embodiment uses text-to-speech technology (Text To Speech). Speech, TTS) for conversion; and

一播放裝置15，係設於該頭戴式裝置1上且電性連結該語音轉換模組14，本實施例中該播放裝置15係以喇叭作為舉例。 A playback device 15 is provided on the head-mounted device 1 and is electrically connected to the voice conversion module 14. In this embodiment, the playback device 15 is a speaker as an example.

藉由上述之說明，已可了解本技術之結構，而依據這個結構之對應配合，更可達到操作動作簡便隱蔽、及能快速辨識物品以自動口語說明眼前情境等優勢，而詳細之解說將於下述說明。 Through the above description, we can understand the structure of this technology, and based on the corresponding cooperation of this structure, we can achieve the advantages of simple and concealed operation, and the ability to quickly identify objects and automatically explain the situation at hand. The detailed explanation will be in Instructions below.

請同時配合參閱第一圖至第五圖所示，係為本新型較佳實施例之立體透視圖至語音播放示意圖，藉由上述構件組構時，由圖中可清楚看出，本新型之實體裝備只有頭戴式裝置1及導盲杖2，與一般視障者的常用配備相同，因此視障者仍只要配戴頭戴式裝置1、手持導盲杖2即可實施，外觀上並無二致，接著透過第一無線通訊元件111與第二無線通訊元件22進行頭戴式裝置1與導盲杖2的藍芽配對連結，即可隨時進行辨識及轉換。 Please also refer to Figures 1 to 5, which are three-dimensional perspective views to voice playback schematic diagrams of preferred embodiments of the present invention. When assembled with the above components, it can be clearly seen from the figures that the features of the present invention The only physical equipment is the head-mounted device 1 and the guide cane 2, which are the same as those commonly used by the visually impaired. Therefore, the visually impaired can still perform the operation by wearing the head-mounted device 1 and holding the guide stick 2, and there is no difference in appearance. Then, the head-mounted device is carried out through the first wireless communication element 111 and the second wireless communication element 22. Device 1 and guide cane 2 are connected via Bluetooth to enable identification and conversion at any time.

當使用者移動至任一定點，並想確認眼前的人事物景象時(例如眼前的一幅畫)，只要按壓導盲杖2上的第一控制部21，使其驅動攝像元件11以拍攝取得一靜態影像資訊，此動作對視障者而言只是動一下握持於導盲杖2的手指，動作簡單且隱蔽，旁人並無法看出此操作動作，此時影像辨識模組12會自動辨識出影像資訊中的物品資訊，其辨識原理包括幾個部分：影像掃描(Image scan)、物件特徵辨識、及輸出介面(Output interface)。具體而言，係先透過攝像元件11將待輸入之內容掃描成一個或一個以上的影像，再將影像送給影像辨識模組12以偵測影像中的物件外觀物理特徵，並將此辨識結果與雲端的物件特徵資料庫進行匹配，即可從物件特徵資料庫中選出對應物件名稱，而將該物件名稱作為物品資訊傳遞給文章生成模組13，例如：一片沙灘、一座涼亭、一座山、三隻鳥、兩艘船。 When the user moves to any certain point and wants to confirm the people and things in front of him (for example, a picture in front of him), he only needs to press the first control part 21 on the guide stick 2 to drive the imaging element 11 to capture the image. A static image information. For the visually impaired, this action is just a movement of the fingers holding the guide stick 2. The action is simple and hidden. Others cannot see this operation action. At this time, the image recognition module 12 will automatically recognize it. To extract item information from image information, the recognition principle includes several parts: image scan, object feature recognition, and output interface. Specifically, the content to be input is first scanned into one or more images through the camera element 11, and then the image is sent to the image recognition module 12 to detect the physical characteristics of the object appearance in the image, and the recognition result is By matching with the object feature database in the cloud, the corresponding object name can be selected from the object feature database, and the object name is passed to the article generation module 13 as item information, for example: a beach, a pavilion, a mountain, Three birds, two boats.

再由文章生成模組13結合影像描述資料庫131中的資料，整合所有物品資訊產生文意通暢的文字資訊，其中該文章生成模組13係為由深度神經網絡模型構成之語言模型chatGPT，而所述GPT為基於轉換器的生成式預訓練模型(Generative pre-trained transformers，GPT)，是一種大型語言模型，也是生成式人工智慧的重要框架，並能夠處理文字和圖像的輸入，而後使用人類自然對話方式來以文字互動，還可以用於甚為複雜的語言工作，包括自動生成文字、自動問答、自動摘要等多種任務，故只要輸入一些文字或影像，文章生成模組13便能自動拼湊出一段語意通暢的文字資訊，在2023年推出的GPT-4模型版本中，更可以進一步適應特定任務和/或主題領域，形成更具針對性的系統，例如將本新型之影像描述資料庫131納入訓練模型中，使文章生成模組13更擅長針對影像資訊的內容去生成一段用於描述影像內容的文字資訊，以本實施例舉例，該影像資訊為一幅畫，並整合影像辨識模組12提供的物品資訊，而生成一段如「畫面為一幅山水畫，畫中場景為海灘，海灘與海水的盡頭有一座高山，海灘上有座涼亭，海上有兩艘帆船，天空中有三隻鳥」之文字資訊。 Then the article generation module 13 combines the data in the image description database 131 to integrate all the item information to generate smooth text information. The article generation module 13 is a language model chatGPT composed of a deep neural network model, and The GPT is a transformer-based generative pre-trained model (GPT). It is a large-scale language model and an important framework for generative artificial intelligence. It can process text and image input and then use Humans use natural dialogue to interact with text, and can also be used for very complex language work, including automatic text generation, automatic question and answer, automatic summarization and other tasks. Therefore, as long as some text or images are input, the article generation module 13 can automatically Piece together a piece of text information with smooth semantics. In the GPT-4 model version launched in 2023, it can be further adapted to specific tasks and/or subject areas to form a more targeted system. For example, this new image description database 131 is incorporated into the training model to make the article generation module 13 better at generating a piece of text information for describing the content of the image information. Taking this embodiment as an example, the image information is a painting, and the image recognition model is integrated Group 12 provides item information and generates a paragraph such as "The picture is a landscape painting. The scene in the painting is a beach. There is a mountain at the end of the beach and the sea. There is a pavilion on the beach. There are two sailboats on the sea. There are three birds in the sky." ” text information.

最後利用語音轉換模組14將文字資訊轉換為語音訊息，由播放裝置15播放給使用者聽，其中語音轉換模組14的轉換原理包括本文處理及語音合成兩個步驟，第一，本文處理乃進行語言分析、確定單詞邊界、及斷句操作，此步驟可提高語音合成時的品質及流暢度，第二，語音合成乃根據分析後的本文內容，利用特定的算法和規則將其生成語音訊號，或是利用在資料庫內的許多已錄好的語音連接起來。由於輸入內容為單一張照片，輸入條件較為單純，使影像辨識模組12及文章生成模組13的處理速度較快、負擔較低，可更快速的完成辨識轉換動作，對使用者而言，則得以簡單隱蔽的操作方式，輔助使用者對周遭環境視覺辨識力的不足。 Finally, the speech conversion module 14 is used to convert the text information into a speech message, which is played to the user by the playback device 15. The conversion principle of the speech conversion module 14 includes text processing and There are two steps in speech synthesis. First, the processing of this article involves language analysis, word boundary determination, and sentence segmentation operations. This step can improve the quality and fluency of speech synthesis. Second, speech synthesis is based on the analyzed content of this article. Use specific algorithms and rules to generate voice signals, or use many recorded voices in the database to connect them. Since the input content is a single photo and the input conditions are relatively simple, the processing speed of the image recognition module 12 and the article generation module 13 is faster and the burden is lower, and the recognition conversion action can be completed more quickly. For the user, It can be operated in a simple and concealed manner to assist users with insufficient visual recognition of the surrounding environment.

請同時配合參閱第六圖所示，係為本新型再一較佳實施例之實施示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1上設有一電性連結該文章生成模組13之收音元件16，係供接收使用者之語音指令，以供該文章生成模組13調整該文字資訊之內容，並該導盲杖2上設有一第二控制部23，係供驅動該收音元件16，其中收音元件16係為微型麥克風、第二控制部23為設於第一控制部21一側的按壓開關。當使用者認為文章生成模組13所產生之文字資訊內容不夠清楚時，可利用第二控制部23啟動收音元件16，讓使用者以口述方式輸入語音指令，以對文章生成模組13輸入更具體的條件，例如「請問那些東西的位置關係為何？」，此時文章生成模組13即可再根據影像資訊中各該物品資訊的位置關係重新生成一段文字資訊「場景下方右邊為海灘，下方左邊為海水，高山從右邊延伸至海水盡頭的中央，涼亭在山腳下的海灘上，兩艘帆船大約在近攤處，三隻鳥在帆船與高山間的低空位置」。如此一來，使用者可藉由收音元件16與文章生成模組13互動，而取得滿意的回覆。 Please also refer to the sixth figure, which is a schematic diagram of another preferred embodiment of the present invention. It can be clearly seen from the figure that this embodiment is similar to the above-mentioned embodiment, except that the head-mounted device 1 There is a radio component 16 electrically connected to the article generation module 13 for receiving the user's voice command for the article generation module 13 to adjust the content of the text information, and the guide stick 2 is provided with a The second control part 23 is used to drive the sound collecting element 16 , wherein the sound collecting element 16 is a miniature microphone, and the second control part 23 is a push switch provided on one side of the first control part 21 . When the user thinks that the text information content generated by the article generation module 13 is not clear enough, the second control part 23 can be used to activate the radio component 16 to allow the user to input voice commands orally to input updates to the article generation module 13 Specific conditions, such as "What is the positional relationship of those things?" At this time, the article generation module 13 can regenerate a piece of text information based on the positional relationship of the item information in the image information, "The right side of the scene is the beach, and the bottom is the beach." On the left is the sea water, and the mountains extend from the right to the center of the end of the sea water. The pavilion is on the beach at the foot of the mountain. Two sailboats are about close to the stalls. Three birds are at a low altitude between the sailboats and the mountains." In this way, the user can interact with the article generation module 13 through the radio component 16 to obtain a satisfactory reply.

請同時配合參閱第七圖及第八圖所示，係為本新型又一較佳實施例之結構方塊圖及精細定位示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1內具有一電性連結該文章生成模組13之定位元件17，係供產生一位置資訊，且該文章生成模組13乃具有一地圖實景資料庫132，以供該文章生成模組13整合該影像資訊、該物品資訊、該位置資訊，產生一精細定位資訊，其中定位元件17為GPS定位系統，地圖實景資料庫132為儲存有地圖街景圖資的資料庫。由於定位元件17係透過衛星定位，即使能夠知道使用者所在的大概位置，但一般只能掌握到某路段大約100公尺的範圍內，尤其使用者在非移動狀態下處於定點位置時，並無法判斷使用者的方位，因此，當使用按壓第二控制部23啟動收音元件16，使其收到位置判斷指令時，文章生成模組13便會整合影像資訊、物品資訊、位置資訊，搜尋地圖實景資料庫132中的街景資料，來判斷使用者前方街景的精細位置，以輸出關於精細位置的文字資訊。 Please refer to the seventh and eighth figures at the same time, which are structural block diagrams and fine positioning diagrams of another preferred embodiment of the present invention. It can be clearly seen from the figures that this embodiment is similar to the above-mentioned embodiment. , there is only a positioning element 17 electrically connected to the article generation module 13 in the head-mounted device 1 for generating location information, and the article generation module 13 has a map real-time database 132 to generate location information. The article generation module 13 integrates the image information, the item information, and the location information to generate fine positioning information, in which the positioning component 17 is a GPS positioning system, and the map real-view database 132 is a database that stores map street view information. . Since the positioning element 17 is based on satellite positioning, even if it can know the approximate location of the user, it can generally only grasp the range of about 100 meters of a certain road section. Especially when the user is at a fixed position in a non-moving state, it cannot Use judgment Therefore, when the second control part 23 is pressed to activate the radio element 16 to receive the position determination instruction, the article generation module 13 will integrate the image information, item information, and location information to search the map real scene database. 132 to determine the precise location of the street view in front of the user and output text information about the precise location.

舉例而言，定位元件17可定位出使用者位於XX路1號的100公尺範圍內，影像辨識模組12判斷出眼前為座落於三角窗的XX咖啡店，再整合地圖實景資料庫132即可得知XX路1號的100公尺範圍內座落於三角窗的XX咖啡店之具體地址為何，而取得一精細定位資訊及當前面對之方位，不但有利於使用者對所在位置的認知，也可在迷路時提供親友自身所在位置的正確資訊。 For example, the positioning element 17 can locate the user within 100 meters of No. 1 XX Road. The image recognition module 12 determines that the user is the XX coffee shop located in the triangular window, and then integrates the map real-view database 132 You can know the specific address of the XX coffee shop located in the triangular window within 100 meters of No. 1 XX Road. Obtaining precise positioning information and the current direction not only helps the user to know the location. Cognition can also provide correct information about the location of relatives and friends when they are lost.

請同時配合參閱第九圖及第十圖所示，係為本新型另一較佳實施例之結構方塊圖及人物辨識示意圖，由圖中可清楚看出，本實施例與上述實施例為大同小異，僅於該頭戴式裝置1內具有一臉部辨識模組18、及一電性連結該臉部辨識模組18及該文章生成模組13之已知人物資料庫133，以供該文章生成模組13針對該影像資訊中的人物調整該文字資訊之內容，另本實施例之播放裝置15係以耳機之態樣作為舉例，該臉部辨識模組18係以人臉辨識軟體做為舉例，例如FaceMe、或FaceMaze其中之一者，所述人臉辨識是生物辨識技術的一種，其運作原理係以向量方式擷取臉部特徵值，並與事先登錄的臉孔之特徵值進行比對，進而透過深度神經網路，以演算法及數學算式量測人臉的各項變數、化為特徵值，再比對資料庫以找出該人臉之正確身分。具體而言，人臉辨識的動作包含有三個步驟：臉部偵測、臉部特徵值擷取及臉部辨識，第一，臉部偵測，透過臉部偵測技術，即使僅局部的臉部出現於畫面之中，仍可於影像或影片中精準掃描、偵測及框列人臉之所在位置；第二，臉部特徵值擷取，利用臉部辨識引擎將框列出的臉部區分成n個維度，例如，高精準度的臉部辨識引擎之n值為1024時，可將臉部切分成1024維度的矩陣，將例如：鼻子的長度與寬度、額頭寬度、眼睛形狀等各項變數擷取出以向量為基礎的臉部特徵值；第三，臉部辨識，將臉部特徵值與資料庫中預先登錄的人臉進行特徵值比對，識別出正確身分，以1：N比對為例，是以在畫面中出現的臉部特徵值，與資料庫中N個預先登錄的臉部進行比對，識別出身分。 Please refer to Figures 9 and 10 at the same time, which are structural block diagrams and character recognition diagrams of another preferred embodiment of the present invention. It can be clearly seen from the figures that this embodiment is similar to the above-mentioned embodiment. , there is only a facial recognition module 18 in the head-mounted device 1, and a known person database 133 electrically connected to the facial recognition module 18 and the article generation module 13 to provide the article. The generation module 13 adjusts the content of the text information according to the characters in the image information. In addition, the playback device 15 in this embodiment is in the form of headphones as an example, and the facial recognition module 18 is based on face recognition software. For example, take one of FaceMe or FaceMaze. The facial recognition is a type of biometric recognition technology. Its operating principle is to capture facial feature values in a vector manner and compare it with the feature values of previously registered faces. Yes, and then through deep neural networks, algorithms and mathematical formulas are used to measure various variables of the face, turn them into feature values, and then compare them with the database to find out the correct identity of the face. Specifically, the face recognition action includes three steps: face detection, facial feature value extraction and face recognition. First, face detection. Through face detection technology, even if only part of the face is Even if the face appears in the screen, it can still accurately scan, detect and frame the position of the face in the image or video; secondly, extract the facial feature values and use the facial recognition engine to frame the face Divide into n dimensions. For example, when the n value of a high-precision facial recognition engine is 1024, the face can be divided into a matrix of 1024 dimensions, such as: the length and width of the nose, the width of the forehead, the shape of the eyes, etc. The first variable extracts the facial feature values based on the vector; third, facial recognition, compares the facial feature values with the pre-registered faces in the database to identify the correct identity, with a ratio of 1:N For example, the comparison is to compare the facial feature values that appear on the screen with N pre-registered faces in the database to identify the identity.

當使用者利用第一控制部21拍照，且影像辨識模組12辨識出影像資訊中有人物存在時，便會自動連結到臉部辨識模組18，以將該人物的人臉資訊比對已知人物資料庫133中的人臉資訊，當出現比對符合的人臉資訊時，乃將物品資訊中較攏統的人物敘述(如「一個女人」)變更為該人物的名字或稱謂，以供文章生成模組13調整其文字資訊予播放裝置15，例如調整為「姊姊迎面而來」，而若已知人物資料庫133中無比對符合的人臉資訊時，則判定該人物為陌生人，其輸出的物品資訊將維持原本攏統的人物敘述。藉此，不但可幫助使用者對眼前人物作辨識，也可避免多餘的文字資訊。 When the user uses the first control unit 21 to take pictures and the image recognition module 12 recognizes When there is a person in the image information, it will be automatically connected to the facial recognition module 18 to compare the face information of the person with the face information in the known person database 133. When a matching face appears, When providing information, the more general character description (such as "a woman") in the item information is changed to the name or title of the character, so that the article generation module 13 can adjust its text information to the playback device 15, for example, adjust it to " "Sister is coming." If there is no matching face information in the known character database 133, the character is determined to be a stranger, and the output item information will maintain the original unified character narrative. This not only helps users identify the person in front of them, but also avoids redundant text information.

惟，以上所述僅為本新型之較佳實施例而已，非因此即侷限本新型之專利範圍，故舉凡運用本新型說明書及圖式內容所為之簡易修飾及等效結構變化，均應同理包含於本新型之專利範圍內，合予陳明。 However, the above descriptions are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Therefore, any simple modifications and equivalent structural changes made using the contents of the description and drawings of the present invention should be treated in the same way. It is included in the patent scope of this new model and is hereby stated.

是以，本新型之影像轉語音之視障輔助裝置為可改善習用之技術關鍵在於： Therefore, the key points of this new type of image-to-speech assistive device for the visually impaired that can improve usage are:

第一，文章生成模組13乃由深度神經網絡模型構成的語言模型chatGPT，係以AI人工智慧技術驅動自然語言處理技術的升級版聊天機器人，可自我學習、自主完善，模擬人類進行較自然流暢的對話，因此結合影像辨識模組12能產生以文意通暢的句子描述影像資訊的內容，而利於使用者理解。 First, the article generation module 13 is a language model chatGPT composed of a deep neural network model. It is an upgraded version of the chat robot driven by AI artificial intelligence technology and natural language processing technology. It can learn and improve independently, simulating human beings to speak more naturally and smoothly. dialogue, therefore, combined with the image recognition module 12, content describing the image information in clear sentences can be generated, which is conducive to user understanding.

第二，以導盲杖2上的第一控制部21驅動頭戴式裝置1上的攝像元件11，可僅在需要的時候的發出辨識請求，操作上更為簡單、隱蔽。 Second, by using the first control part 21 on the guide stick 2 to drive the camera element 11 on the head-mounted device 1, the recognition request can be issued only when needed, making the operation simpler and more concealed.

第三，輸入的內容為使用者半自動拍攝的單張靜態照片，輸入條件較為單純，使影像辨識模組12及文章生成模組13的處理速度較快、負擔較低，可更快速的完成辨識轉換動作。 Third, the input content is a single static photo taken semi-automatically by the user, and the input conditions are relatively simple, so that the image recognition module 12 and the article generation module 13 have faster processing speed and lower burden, and can complete the recognition more quickly. Convert actions.

第四，利用收音元件16的設計，當使用者認為文章生成模組13所產生之文字資訊內容不夠清楚時，使用者得以口述方式輸入語音指令，以輸入更具體的條件，而藉由收音元件16與文章生成模組13互動，而取得滿意的回覆。 Fourth, using the design of the radio element 16, when the user thinks that the text information content generated by the article generation module 13 is not clear enough, the user can enter voice commands orally to enter more specific conditions, and through the radio element 16 16Interact with the article generation module 13 and obtain satisfactory responses.

第五，利用定位元件17及地圖實景資料庫132的設計，可取得一精細定位資訊及當前面對之方位，不但有利於使用者對所在位置的認知，也可在迷路時提供親友自身所在位置的正確資訊。 Fifth, using the design of the positioning element 17 and the map real-time database 132, a fine positioning information and the current direction can be obtained, which not only helps the user to understand the location, but also provides the location of relatives and friends when they are lost. correct information.

第六，利用臉部辨識模組18及已知人物資料庫133的設計，不但可幫助使用者對眼前人物作辨識，也可避免針對陌生人生成多餘的文字資訊。 Sixth, the design of the facial recognition module 18 and the known person database 133 can not only help the user identify the person in front of him, but also avoid generating redundant text data for strangers. News.

綜上所述，本新型之影像轉語音之視障輔助裝置於使用時，為確實能達到其功效及目的，故本新型誠為一實用性優異之新型，為符合新型專利之申請要件，爰依法提出申請，盼審委早日賜准本新型，以保障申請人之辛苦創作，倘若鈞局審委有任何稽疑，請不吝來函指示，申請人定當竭力配合，實感德便。 To sum up, the image-to-speech auxiliary device for the visually impaired of this new model can indeed achieve its effect and purpose when used. Therefore, this new model is truly a new model with excellent practicality and meets the application requirements for a new model patent. The application is submitted in accordance with the law, and we hope that the review committee will approve this model as soon as possible to protect the applicant's hard work. If the review committee of the Jun Bureau has any doubts, please feel free to write a letter for instructions. The applicant will do its best to cooperate and it will be greatly appreciated.

1:頭戴式裝置 1:Head mounted device

11:攝像元件 11:Camera components

111:第一無線通訊元件 111:The first wireless communication component

12:影像辨識模組 12:Image recognition module

13:文章生成模組 13: Article generation module

131:影像描述資料庫 131: Image description database

14:語音轉換模組 14: Voice conversion module

15:播放裝置 15:Playback device

2:導盲杖 2:Guide cane

21:第一控制部 21:First Control Department

22:第二無線通訊元件 22: Second wireless communication component

Claims

An image-to-speech assistive device for the visually impaired, which mainly includes:

head-worn device;

A camera element is installed on the head-mounted device, and the photographing direction is the same as the line of sight of the head-mounted device, in order to obtain an image information;

A first wireless communication component is provided on the head-mounted device and is electrically connected to the camera component;

A guide stick has a first control part for driving the camera element, and a second wireless communication element that is electrically connected to the first control part and wirelessly connected to the first wireless communication element;

An image recognition module is installed in the head-mounted device and electrically connected to the camera element to identify a plurality of item information from the image information;

An article generation module is installed in the head-mounted device and is electrically connected to the image recognition module, and has an image description database to generate a clear text information based on the image information and the item information. , and the article generation module is a language model chatGPT composed of a deep neural network model;

A voice conversion module is installed in the head-mounted device and is electrically connected to the article generation module to convert the text information into a voice message; and

A playback device is installed on the head-mounted device and electrically connected to the voice conversion module.

For example, the image-to-speech assistive device for the visually impaired described in Item 1 of the patent application scope, wherein the head-mounted device is one of glasses, sunglasses or goggles.

For example, in the image-to-speech assistive device for the visually impaired described in Item 1 of the patent application, the head-mounted device is provided with a radio component electrically connected to the article generation module for receiving the user's voice commands. Allows the article generation module to adjust the content of the text information.

For example, in the image-to-speech assistive device for the visually impaired described in Item 3 of the patent application, the guide stick is provided with a second control part for driving the radio element.

For example, the image-to-speech assistive device for the visually impaired described in item 3 of the patent application, wherein the head-mounted device has a positioning element electrically connected to the article generation module for generating a location information, and the article The generation module has a real map database for the article generation module to integrate the image information, the item information, and the location information to generate fine positioning information.

For example, the image-to-speech assistive device for the visually impaired described in item 1 of the patent application, wherein the head-mounted device has a facial recognition module and an electrical connection between the facial recognition module and the article generation module. A database of known characters is provided for the article generation module to adjust the content of the text information according to the characters in the image information.

For example, in the image-to-speech assistive device for the visually impaired described in Item 1 of the patent application, the playback device is either a speaker or an earphone.