TW202011219A

TW202011219A - System for document searching using results of text analysis and natural language input

Info

Publication number: TW202011219A
Application number: TW107130696A
Authority: TW
Inventors: 劉秉錦; 林鼎超; 林庭箴
Original assignee: 愛酷智能科技股份有限公司
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2020-03-16
Also published as: TWI682286B

Abstract

A system for document searching using results of text analysis and natural language input is disclosed. The system includes a keyword capture unit, a database module, a network platform communication unit, a statement parsing unit, a file list providing unit, and a sentence learning unit. The invention can perform keyword analysis on stored files to enhance searching speed. It can also perform file search by receiving user input of natural language containing keywords on the cloud platform, further facilitating processes to search files and saving time for obtaining related files.

Description

Document search system using text analysis results and natural language input

本發明關於一種文件搜尋系統，特別是一種利用文字解析結果與自然語言輸入的文件搜尋系統。The invention relates to a document search system, in particular to a document search system using text analysis results and natural language input.

在傳統的資訊檔案管理作業中，一般的檔案儲存方式都會依照檔案的屬性（比如文字檔、圖檔或影音檔）、特性（比如內容主題或生成時間）、檔名等分類，分別儲存於相對應的資料夾中，所有的資料夾以一個樹狀結構呈現，以方便使用者存取檔案。在這種架構下，只要掌握檔案名稱或資料夾屬性，要尋找特定檔案十分方便。In traditional information file management operations, common file storage methods are classified according to file attributes (such as text files, graphic files, or audiovisual files), characteristics (such as content topics or creation time), file names, etc. In the corresponding folders, all folders are presented in a tree structure to facilitate users to access files. Under this structure, as long as you know the file name or folder attributes, it is very convenient to find a specific file.

當資料是在網路上開放存取時，除非給定特定URL，若要找尋特定檔案或是與某個主題有關的檔案，搜尋時間會變得相當漫長。一來，這跟使用者不熟悉檔案儲存架構有關；二來，搜尋關鍵字需要進一步用來比對所有的檔案（名稱或metadata），這也需耗時間；最後，網路上可能會有許多使用者同時使用檔案存取服務功能，資料的處理也造成延遲。因此，如果要減少因前述原因造成的時間拖延，需要一種好的文件搜尋系統。When the data is open access on the Internet, unless a specific URL is given, if you want to find a specific file or a file related to a certain topic, the search time will become quite long. First, it is related to the user’s unfamiliar file storage structure; second, the search keyword needs to be further used to compare all files (name or metadata), which also takes time; finally, there may be many uses on the Internet If the file access service function is used at the same time, the processing of the data also causes a delay. Therefore, if you want to reduce the time delay caused by the aforementioned reasons, a good file search system is needed.

以Google關鍵字查詢為例，這是一種可以透過關鍵字在網路上所有的網頁資料中，找尋並提供相關網頁連結的技術。由於一個網頁所在的形式也是某個儲存設備的中的檔案夾或檔案，這種查詢方式也可以用作建立前述文件搜尋系統的參考。Google關鍵字查詢的搜尋引擎其實主要在做兩件事情，分別是爬行網站（crawling）與建立網站索引（index），並使用網頁相關性去排序／排名這些搜尋結果網頁。搜尋引擎工作的原理是搜尋索引頁面，而不是瀏覽網站全部的資料庫內容。搜尋引擎利用資料探勘爬蟲（一種軟體），來爬行使用者輸入的關鍵字，並且根據頁面的關鍵字、內容的關鍵字與網站相關性，提供用戶相關的網頁連結。整個過程可以不到半秒鐘。因此，理論上，只要針對一個雲端資料庫中儲存的檔案建立相對應的關鍵字資料，透過接受到的關鍵字，便能很快地找到想要的資料。Taking Google keyword query as an example, this is a technology that can find and provide related web page links in all web page data on the Internet through keywords. Since the form of a web page is also a folder or file in a storage device, this query method can also be used as a reference for establishing the aforementioned document search system. The search engine of Google's keyword query is actually mainly doing two things, namely crawling and indexing, and using page relevance to sort/rank these search results pages. The search engine works by searching the index page instead of browsing the entire database content of the website. Search engines use data exploration crawlers (a type of software) to crawl keywords entered by users, and provide user-related webpage links based on the keywords of the page and the content and the relevance of the website. The whole process can take less than half a second. Therefore, in theory, as long as the corresponding keyword data is created for the files stored in a cloud database, the desired data can be quickly found through the received keywords.

另一方面，當使用者透過雲端特定或不特定的平台進行關鍵字尋找檔案時，有時該關鍵字並不能確定能找尋到所有相關的檔案。比如使用者輸入「facebook」或「我要facebook」，可能是要尋找與facebook相關的檔案；然而，檔案來源是以檔名、內容還是metadata判斷，語焉不詳。此外，如果檔案中有關鍵字「臉書」的，要不要也一起提供給使用者呢?On the other hand, when users search for files through keywords on a specific or unspecified platform in the cloud, sometimes the keyword cannot be sure to find all related files. For example, if the user enters "facebook" or "I want facebook", it may be to find files related to facebook; however, the source of the file is determined by the file name, content or metadata, and the language is unknown. In addition, if there is a keyword "Facebook" in the file, should we also provide it to the user?

針對以上的需求，人們需要一種文件搜尋系統，該文件搜尋系統能針對儲存檔案進行關鍵字解析以強化搜尋速度，也可以透過於雲端平台接收使用者輸入含有關鍵字的自然語法進行文件搜尋，以便利使用者搜尋檔案及節省獲得相關檔案的時間。In response to the above requirements, people need a document search system that can perform keyword analysis on stored files to enhance the search speed, and can also search for documents by receiving user-entered natural grammar containing keywords on the cloud platform to It is convenient for users to search files and save time for obtaining related files.

本段文字提取和編譯本發明的某些特點。其它特點將被揭露於後續段落中。其目的在涵蓋附加的申請專利範圍之精神和範圍中，各式的修改和類似的排列。This paragraph extracts and compiles certain features of the invention. Other features will be revealed in subsequent paragraphs. Its purpose is to cover the spirit and scope of the additional patent application scope, various modifications and similar arrangements.

本發明的目的在於提供一種利用文字解析結果與自然語言輸入的文件搜尋系統，以針對儲存檔案進行關鍵字解析來強化搜尋速度，也可以透過於雲端平台接收使用者輸入含有關鍵字的自然語法進行文件搜尋，以便利使用者搜尋檔案及節省獲得相關檔案的時間。該系統包含：一關鍵字擷取單元，用以從含有文字資訊的檔案中取得該文字資訊的對應的字元編碼，並從該字元編碼中擷取至少一檔案名詞關鍵字；一資料庫模組，包含：一檔案資料庫，儲存複數個檔案；一名詞關鍵字資料庫，儲存複數個名詞關鍵字；及一檔案關鍵字連結資料庫，儲存該檔案資料庫中每一檔案與自該關鍵字擷取單元擷取的至少一檔案名詞關鍵字的一對照表，其中該對照表另記註該檔案對照的所有檔案名詞關鍵字的關鍵字來源；一網路平台通聯單元，運作以通過網路與一網路平台伺服器連接、接收來自該網路平台伺服器上運行的網路平台的用戶以自然語言輸入的一查詢詞句，及發送針對該查詢詞句的一檔案列表或一引導語句給該網路平台伺服器，以傳送給該用戶；一語句解析單元，用以從該查詢詞句中擷取至少一查詢名詞關鍵字及對應的關鍵字來源，及當缺少該關鍵字來源時，透過該網路平台通聯單元發出該引導語句以獲取該關鍵字來源；及一檔案列表提供單元，用以將該查詢名詞關鍵字及該關鍵字來源對該對照表進行查詢、整理所有符合的檔案的檔名於該檔案列表中，及透過該網路平台通聯單元發出該檔案列表。該關鍵字擷取單元擷取的至少一檔案名詞關鍵字及該語句解析單元擷取的至少一查詢名詞關鍵字來自該名詞關鍵字資料庫。The object of the present invention is to provide a document search system that uses text analysis results and natural language input to perform keyword analysis on stored files to enhance the search speed, and can also be performed by receiving user input of natural grammar containing keywords on a cloud platform Document search, to facilitate users to search files and save time to get related files. The system includes: a keyword retrieval unit for obtaining the corresponding character code of the text information from the file containing the text information, and retrieving at least one file noun keyword from the character code; a database Module, including: a file database to store multiple files; a noun keyword database to store plural noun keywords; and a file keyword link database to store each file in the file database A retrieval table of at least one file noun keyword retrieved by the keyword retrieval unit, wherein the comparison table further notes the keyword sources of all file noun keywords retrieved by the file; a network platform communication unit, which operates to pass The network is connected to a network platform server, receives a query term input in natural language from a user of the network platform running on the network platform server, and sends a file list or a leading sentence for the query term To the server of the network platform for transmission to the user; a sentence parsing unit for retrieving at least one query noun keyword and corresponding keyword source from the query phrase, and when the keyword source is missing, Issue the guide sentence through the network platform communication unit to obtain the source of the keyword; and a file list providing unit for querying the query noun keyword and the source of the keyword to the comparison table and sorting out all matching files The file name is in the file list, and the file list is issued through the network platform communication unit. At least one file noun keyword retrieved by the keyword retrieval unit and at least one query noun keyword retrieved by the sentence analysis unit come from the noun keyword database.

利用文字解析結果與自然語言輸入的文件搜尋系統可進一步包含一語句學習單元，用以透過一機器學習演算法，學習獲得該查詢詞句除去該查詢名詞關鍵字及該關鍵字來源後，剩餘文字的複數個等效文字組合，並利用該些等效文字組合協助查詢名詞關鍵字、關鍵字來源或／及查詢詞句語意之判定。The document search system using text analysis results and natural language input may further include a sentence learning unit for learning the query phrase through a machine learning algorithm. After removing the query noun keyword and the source of the keyword, the remaining text A plurality of equivalent text combinations, and use these equivalent text combinations to assist in determining noun keywords, keyword sources, and/or semantic meaning of query words.

最好，該機器學習演算法可為TF-IDF演算法。Preferably, the machine learning algorithm may be a TF-IDF algorithm.

在一實施例中，該名詞關鍵字資料庫可進一步儲存每一名詞關鍵字的對應同義名詞關鍵字。當該語句解析單元擷取到某一查詢名詞關鍵字時，該檔案列表提供單元會以該查詢名詞關鍵字、該查詢名詞關鍵字的同義名詞關鍵字與該關鍵字來源對該對照表進行查詢、整理所有符合的檔案的檔名於該檔案列表中，並透過該網路平台通聯單元發出該檔案列表。In one embodiment, the noun keyword database may further store the corresponding synonymous noun keywords for each noun keyword. When the sentence parsing unit retrieves a query noun keyword, the file list providing unit will query the comparison table based on the query noun keyword, the synonymous noun keyword of the query noun keyword, and the keyword source 2. Arrange the file names of all matching files in the file list, and send the file list through the network platform communication unit.

依照本發明，文字資訊型態可為字元編碼、文字圖像，或具有嵌入文字圖像物件的檔案。文字圖像與具有嵌入文字圖像物件的檔案利用光學文字辨識技術（Optical Character Recognition，OCR）轉換為對應的字元編碼。檔案的格式可為可攜式文件格式（Portable Document Format，PDF）、Power Point檔案格式、Power Point檔案兼容格式、Word檔案格式、Word檔案兼容格式、Excel檔案格式，或Excel檔案兼容格式。該關鍵字來源可為檔名、檔案內容，或檔案元資料（Metadata）。According to the present invention, the text information type may be a character code, a text image, or a file with embedded text image objects. Text images and files with embedded text image objects are converted into corresponding character codes using optical character recognition technology (Optical Character Recognition, OCR). The file format can be Portable Document Format (PDF), Power Point file format, Power Point file compatible format, Word file format, Word file compatible format, Excel file format, or Excel file compatible format. The source of the keyword can be file name, file content, or file metadata (Metadata).

查詢名詞關鍵字及關鍵字來源的擷取方法可包含步驟：(a)將查詢詞句依字詞分斷；(b)將分段後的字詞給予詞性；(c)將屬於名詞類，又出現於名詞關鍵字資料庫的字詞歸類成查詢名詞關鍵字；及(d)將屬於名詞類，又與複數個預設的關鍵字來源的同義詞相同的字詞歸類成關鍵字來源。The retrieval method for querying noun keywords and keyword sources may include the steps of: (a) segmenting the query terms according to words; (b) giving the segmented words part of speech; (c) belonging to the category of nouns, and Words appearing in the noun keyword database are classified as query noun keywords; and (d) Words that belong to the noun category and are synonymous with plural preset keyword sources are classified as keyword sources.

該名詞關鍵字可來源於中華民國中央研究院中文句結構樹資料庫。The noun keyword can be derived from the Chinese sentence structure database of the Central Research Institute of the Republic of China.

本發明利用關鍵字擷取單元來將檔案名詞關鍵字自檔案中找出，以便建立檔案與檔案名詞關鍵字的關聯，同時語句解析單元與檔案列表提供單元可藉由分析用戶端的查詢詞句，精準快速地找出對應該查詢詞句中的查詢名詞關鍵字的檔案，並將所有檔案以列表方式，呈現在用戶端。如此可滿足強化搜尋速度，及於雲端平台接收用戶輸入含有關鍵字的自然語法進行文件搜尋的需求，進而便利用戶搜尋檔案及節省獲得相關檔案的時間。The invention uses the keyword retrieval unit to find the file noun keywords from the file, so as to establish the relationship between the file and the file noun keywords. At the same time, the sentence parsing unit and the file list providing unit can analyze the query words and sentences of the user terminal to accurately Quickly find the files corresponding to the query noun keywords in the query words and sentences, and present all the files in a list to the user. In this way, it can meet the needs of enhancing the search speed and receiving the user's input of natural grammar containing keywords on the cloud platform to search for documents, thereby facilitating the user to search for files and saving time for obtaining related files.

本發明將藉由參照下列的實施方式而更具體地描述。The present invention will be described more specifically by referring to the following embodiments.

請見圖1，該圖說明依照本發明實施例的一種利用文字解析結果與自然語言輸入的文件搜尋系統（以下簡稱系統）之元件與運作方式。該系統可以架設於一伺服器10上，進而透過一網路20與至少一個網路平台伺服器30訊息連接。本發明所揭露的系統包含了一關鍵字擷取單元110、一資料庫模組120、一網路平台通聯單元130、一語句解析單元140、一檔案列表提供單元150及一語句學習單元160。以下分別介紹各元件的功能與互動作用方式。Please refer to FIG. 1, which illustrates the components and operation modes of a document search system (hereinafter referred to as system) using text analysis results and natural language input according to an embodiment of the present invention. The system can be erected on a server 10, and then connected to at least one network platform server 30 via a network 20. The system disclosed by the present invention includes a keyword retrieval unit 110, a database module 120, a network platform communication unit 130, a sentence analysis unit 140, a file list providing unit 150, and a sentence learning unit 160. The functions and interaction methods of each component are introduced below.

關鍵字擷取單元110可來從含有文字資訊的檔案中取得該文字資訊的對應的字元編碼，並從該字元編碼中擷取至少一檔案名詞關鍵字。本系統的一個方面是要掌控所有存在資料庫模組120的檔案中的關鍵字，以便能夠快速找到用戶想要資料相關的檔案（文字檔、圖檔或嵌有圖檔的其它檔案）。因此，關鍵字擷取單元110就是用來”深入了解”一個檔案中有哪些關鍵字。這裡，檔案含有的文字資訊，其型態可以是就是字元編碼本身，該字元編碼可以是但不限於ASCII、ISO/IEC 646、ISO/IEC 646、DOS字元集、Windows字元集、Big5、CNS 11643、ISO/IEC 2022、GB 2312、EUC、Unicode、UTF-8等。這些字元編碼可將特定地區文字、所有已知文字與部分文字圖像以特殊碼來呈現。在世界上所有的電子文件編輯器中，都有可能使用前述至少一者來表達使用者編輯的文字。因此，文字資訊內容上會是一連串的單字，其就有可能以一種字元編碼來呈現。此外，文字資訊也可能是一個文字圖像，最簡單的說明就是一張拍攝書本某頁的照片，該照片一定有那一頁的文字。對於文字圖像，關鍵字擷取單元110可利用光學文字辨識技術（Optical Character Recognition，OCR），將之轉換為對應的字元編碼。當然，OCR的技術已行之多年，相關的軟體或硬體可以與進入關鍵字擷取單元110整合，或是當作支援模組而在關鍵字擷取單元110外部進行作業。文字資訊更可能是嵌入文字圖像物件的檔案，該嵌入文字圖像物件的檔案比如嵌有前述照片的word檔或pdf檔。當然，這種檔案內嵌的文字圖像物件中的文字資訊，也可以藉由OCR技術來取得。The keyword retrieving unit 110 can obtain the corresponding character code of the text information from the file containing the text information, and retrieve at least one file noun keyword from the character code. One aspect of this system is to control the keywords in all the files stored in the database module 120, so as to be able to quickly find the files (text files, graphic files, or other files with embedded graphic files) related to the data that the user wants. Therefore, the keyword retrieval unit 110 is used to "deeply understand" which keywords are in a file. Here, the type of text information contained in the file may be the character encoding itself. The character encoding may be, but not limited to, ASCII, ISO/IEC 646, ISO/IEC 646, DOS character set, Windows character set, Big5, CNS 11643, ISO/IEC 2022, GB 2312, EUC, Unicode, UTF-8, etc. These character codes can present special regional text, all known text and some text images with special codes. In all electronic document editors in the world, it is possible to use at least one of the foregoing to express the text edited by the user. Therefore, the content of the text information will be a series of words, which may be presented in a character encoding. In addition, the text information may also be a text image. The simplest explanation is to take a photo of a page of a book, and the photo must have the text of that page. For text images, the keyword extraction unit 110 may use optical character recognition technology (Optical Character Recognition, OCR) to convert it to the corresponding character encoding. Of course, OCR technology has been used for many years, and related software or hardware can be integrated with the access keyword retrieval unit 110, or used as a support module to operate outside the keyword retrieval unit 110. The text information is more likely to be a file embedded with a text image object, such as a word file or a pdf file embedded with the aforementioned photo. Of course, the text information in the text image object embedded in this file can also be obtained by OCR technology.

檔案的格式可以有多種，比如可攜式文件格式（Portable Document Format，PDF）、Power Point檔案格式、Power Point檔案兼容格式、Word檔案格式、Word檔案兼容格式、Excel檔案格式，及Excel檔案兼容格式。前述的xx檔案兼容格式指的是一檔案格式可以兩者以上的編輯器開啟，而某一編輯器可將該檔案格式轉為另一特定檔案格式，該特定檔案格式即為檔案兼容格式。舉例而言，OpenOffice下的Writer可以與Microsoft Office的Word同時開啟與編輯Word檔案格式，但Writer又可將該檔案儲存為其特定副檔名的”.odt”檔案格式，odt檔案格式就是一種Word檔案兼容格式。There are many file formats, such as Portable Document Format (PDF), Power Point file format, Power Point file compatible format, Word file format, Word file compatible format, Excel file format, and Excel file compatible format . The aforementioned xx file compatible format means that a file format can be opened by two or more editors, and an editor can convert the file format to another specific file format, and the specific file format is the file compatible format. For example, Writer under OpenOffice can open and edit the Word file format at the same time as Microsoft Office Word, but Writer can save the file as a ".odt" file format with a specific file extension. The odt file format is a kind of Word File compatible format.

由於前述的字元編碼中攜帶著許多文字，關鍵字擷取單元110便可從字元編碼中擷取至少一檔案名詞關鍵字。這裡，擷取的至少一檔案名詞關鍵字預設於資料庫模組120中，因而藉由比對文字資訊內容，就很容易找到檔案名詞關鍵字。所謂「名詞關鍵字」，意思是關鍵字的詞性是名詞。為了對檔案名詞關鍵字擷取有較佳的理解，請見圖3，該圖繪示一word檔案內容，其文字內容顯示於該圖上半部，下半部以虛線框包圍的文字為檔名資料與檔案元資料（Metadata），後二者在打開word檔是看不到的，只有藉由檔案總管才能找到。文字內容中以實線框框住的文字便是該word檔的檔案名詞關鍵字，比如Facebook、文章、新聞、臉書、原始碼、朋友、說明、操作方式與版本。其它像是一篇、解碼、顯示、關心、簡單等文字，因為不屬於名詞，不會被截取。然而，「緯工邦」這名詞是一個部落格主的名字，雖然是名詞，但因不常使用或不具知名度的緣故，資料庫模組120中沒有將它列為名詞關鍵字，因而也不會被擷取。反之，如果該部落格主將來成名了，這名字如同facebook常被使用，資料庫模組120就有可能將它列為名詞關鍵字，讓關鍵字擷取單元110擷取為檔案名詞關鍵字。Since the foregoing character encoding carries many characters, the keyword extracting unit 110 can extract at least one file noun keyword from the character encoding. Here, the extracted at least one file noun keyword is preset in the database module 120, so by comparing the content of the text information, the file noun keyword can be easily found. The so-called "noun keyword" means that the part of speech of the keyword is a noun. In order to have a better understanding of the retrieval of file noun keywords, please refer to Figure 3, which shows the content of a word file, the text content of which is displayed in the upper half of the figure, and the text enclosed by the dotted frame in the lower half is the file Name data and file metadata (Metadata), the latter two are not visible when opening the word file, and can only be found through the file manager. The text enclosed in solid lines in the text content is the file noun keyword of the word file, such as Facebook, article, news, Facebook, source code, friends, description, operation method and version. Other words such as an article, decode, display, care, simple, etc., because they are not nouns, will not be intercepted. However, the term "Wei Gong Bang" is the name of a blog owner. Although it is a noun, it is not listed as a noun keyword in the database module 120 because of its infrequent use or lack of popularity. Will be captured. Conversely, if the blogger becomes famous in the future, the name is often used as facebook, and the database module 120 may list it as a noun keyword, and the keyword retrieval unit 110 retrieves it as a file noun keyword.

資料庫模組120包含了數個資料庫：一檔案資料庫121、一名詞關鍵字資料庫122與一檔案關鍵字連結資料庫123。檔案資料庫121儲存許多個前述的檔案，名詞關鍵字資料庫122則是儲存前述所有的名詞關鍵字（不僅僅限制於來自單一檔案中的”檔案”名詞關鍵字）。名詞關鍵字的來源有很多。在本實施例中，名詞關鍵字的來源為中華民國中央研究院中文句結構樹資料庫。檔案關鍵字連結資料庫123儲存了檔案資料庫121中每一檔案與自該關鍵字擷取單元110擷取的至少一檔案名詞關鍵字的一對照表。以前一個例子來說明。假設該word檔檔名為”10244852.doc”，儲存於檔案資料庫121中。那麼，對照表的一欄內容就是「10244852.doc」à「Facebook；文章；新聞；臉書；原始碼；朋友；說明；操作方式；版本」，其中”à”表示連結關係，也就是說與”10244852.doc”檔相關的檔案名詞關鍵字有Facebook、文章、新聞、臉書、原始碼、朋友、說明、操作方式與版本。The database module 120 includes several databases: a file database 121, a noun keyword database 122 and a file keyword link database 123. The file database 121 stores many of the aforementioned files, and the noun keyword database 122 stores all the aforementioned noun keywords (not limited to the "file" noun keywords from a single file). There are many sources of noun keywords. In this embodiment, the source of the noun keyword is the Chinese sentence structure database of the Central Research Institute of the Republic of China. The file keyword link database 123 stores a comparison table of each file in the file database 121 and at least one file noun keyword retrieved from the keyword retrieval unit 110. The previous example illustrates this. Assume that the word file name is "10244852.doc" and it is stored in the file database 121. Then, the content of the column of the comparison table is "10244852.doc" à "Facebook; article; news; Facebook; source code; friends; description; mode of operation; version", where "à" indicates the connection relationship, that is to say, with The file noun keywords related to the "10244852.doc" file include Facebook, articles, news, Facebook, source code, friends, description, operation mode and version.

依照本發明，前述的對照表可另記註檔案對照的所有檔案名詞關鍵字的關鍵字來源。這裡，關鍵字來源指的是檔案名詞關鍵字是在哪裡找到的。在本實施例中，關鍵字來源可以是檔名、檔案內容，或檔案元資料。這三者的關係已於圖3中說明，此處不再贅述。According to the present invention, the aforementioned comparison table can additionally note the keyword sources of all file noun keywords for file comparison. Here, the keyword source refers to where the archive noun keyword is found. In this embodiment, the source of the keyword may be the file name, file content, or file metadata. The relationship between these three has been illustrated in Figure 3, and will not be repeated here.

網路平台通聯單元130運作以通過網路20與網路平台伺服器30連接。此外，該網路平台通聯單元130亦可接收來自網路平台伺服器30上運行的網路平台300的用戶以自然語言輸入的一查詢詞句。為了方便說明，用戶可以一桌上型電腦410或一智慧型手機420，硬體上與網路平台伺服器30相連，登錄或連結上網路平台300，進而與本系統的網路平台通聯單元130連接。這裡，網路平台300指的是各種透過網路傳播或互連資訊的服務平台。實作上，網路平台300可能是一個部落格網頁服務平台，網路平台通聯單元130就可透過WEB格式，以特定javascript程式碼驅動客戶端，如桌上型電腦410的瀏覽器。網路平台300也可能是社群或通訊軟體平台，比如LINE、facebook等，網路平台通聯單元130與該些網路平台300的連接可以通過其公開或授權的API。因而，網路平台通聯單元130可進行讓用戶以自然語言輸入查詢詞句。The network platform communication unit 130 operates to connect to the network platform server 30 through the network 20. In addition, the network platform communication unit 130 can also receive a query phrase input in natural language from a user of the network platform 300 running on the network platform server 30. For the convenience of explanation, the user may connect a desktop computer 410 or a smartphone 420 to the network platform server 30 in hardware, log in or connect to the network platform 300, and then connect to the network platform communication unit 130 of the system connection. Here, the network platform 300 refers to various service platforms that spread or interconnect information through the network. In practice, the network platform 300 may be a blog web service platform, and the network platform communication unit 130 can drive a client, such as a desktop computer 410 browser, with a specific javascript code through the WEB format. The network platform 300 may also be a community or communication software platform, such as LINE, facebook, etc. The connection between the network platform communication unit 130 and these network platforms 300 may be through its open or authorized API. Therefore, the network platform communication unit 130 can allow the user to input query terms in natural language.

自然語言，就是一般人可以使用的語言，沒有特定格式，但有可能因為個人語言使用的特色，造成判讀的困難。查詢詞句指的是用戶想要找檔案而發出的所有文字資訊，可包括單一的”名詞”與較長的”句子”。查詢詞句的幾個例子如圖5所示，以下的說明會深入解說之。網路平台通聯單元130在該查詢詞句為後續處理完成後，送針對該查詢詞句的一檔案列表給該網路平台伺服器30，藉以透過網路平台300傳送給傳送查詢詞句的用戶。如果系統對該查詢詞句的語意不明，發送的資料便改為一引導語句。檔案列表與引導語句將於下方說明。Natural language is the language that ordinary people can use. There is no specific format, but it may be difficult to interpret because of the characteristics of personal language use. The query term refers to all text information sent by the user who wants to find the file, and may include a single "noun" and a longer "sentence". Several examples of query terms are shown in Figure 5, and the following description will explain them in depth. The network platform communication unit 130 sends a file list for the query phrase to the network platform server 30 after the query phrase is completed for subsequent processing, so as to be transmitted to the user who transmits the query phrase through the network platform 300. If the semantics of the query term is not clear, the sent data will be changed to a leading sentence. The file list and the guide sentence will be explained below.

語句解析單元140可以從前述的查詢詞句中擷取至少一查詢名詞關鍵字及對應的關鍵字來源。查詢名詞關鍵字及關鍵字來源的擷取方法之流程請參見圖2，其步驟為首先將查詢詞句依字詞分斷（S01）。舉例來說，如果一個查詢詞句為「我想找關於蝴蝶相關的檔案」，依字詞分段便會成為「我／想找／關於／蝴蝶／相關的／檔案」。接著，將分段後的字詞給予詞性（S02）。附上詞性的分段查詢詞句便成為「我（代名詞）／想找（動詞）／關於（介係詞）／蝴蝶（名詞）／相關的（形容詞）／檔案（名詞）」。第三步驟：將屬於名詞類，又出現於名詞關鍵字資料庫122的字詞歸類成查詢名詞關鍵字（S03）。在本例中，符合的有「蝴蝶」與「檔案」，但未必都能歸類成查詢名詞關鍵字。最後，將屬於名詞類，又與複數個預設的關鍵字來源的同義詞相同的字詞歸類成關鍵字來源（S04）。如上所述，關鍵字來源可以是檔名、檔案內容與檔案元資料，而每一者都有同義字。比如檔案內容的預設同義字可以包含檔案、file與content。因此，步驟S03中找的「檔案」，會被視為要尋找的標的，即關鍵字來源。因此，擷取的查詢名詞關鍵字為「蝴蝶」，關鍵字來源為「檔案」，也就是到所有檔案的檔案內容中，找有「蝴蝶」字樣的，並將該檔案標示出來。The sentence analysis unit 140 can extract at least one query noun keyword and the corresponding keyword source from the aforementioned query words and sentences. Please refer to Figure 2 for the process of querying noun keywords and keyword source extraction methods. The steps are to first split query words by words (S01). For example, if a query phrase is "I want to find files related to butterflies", segmenting by words will become "I/want to find/about/butterfly/related/files". Next, the segmented words are given parts of speech (S02). The segmented query with attached part of speech becomes "I (pronoun) / want to find (verb) / about (preposition) / butterfly (noun) / related (adjective) / file (noun)". The third step: classify the words that belong to the noun category and appear in the noun keyword database 122 into query noun keywords (S03). In this example, "butterfly" and "file" are matched, but not all can be classified as query noun keywords. Finally, the words that belong to the noun category and have the same synonyms as the plural preset keyword sources are classified as keyword sources (S04). As mentioned above, the source of keywords can be the file name, file content, and file metadata, and each has a synonym. For example, the default synonyms of file content can include file, file, and content. Therefore, the "file" found in step S03 will be regarded as the target to be found, that is, the source of the keyword. Therefore, the keyword of the query noun retrieved is "butterfly", and the source of the keyword is "file", that is, to find the word "butterfly" in the file content of all files, and mark the file.

請復見圖5。圖5表中每個欄位表示一種查詢詞句類型。為了方便說明，查詢名詞關鍵字下方加上底線，關鍵字來源以粗斜體表示。第一欄的類型只有單一一個名詞。很明顯，該名詞就是查詢名詞關鍵字。只是此時缺少關鍵字來源，查詢到的檔案量會太大，語句解析單元140便可透過網路平台通聯單元130發出該引導語句以獲取關鍵字來源。第二欄「有臉書的相關資料嗎?」點出了查詢名詞關鍵字是「臉書」，但「相關資料」不知是哪一種關鍵字來源，最好還是需要發出引導語句來確認關鍵字來源。第三欄的查詢詞句很清楚地陳述，因此可知「facebook」是查詢名詞關鍵字，「檔案」（即檔案內容）是關鍵字來源。第四欄用布林語法進行聯集，可以找「臉書」或「facebook」的查詢名詞關鍵字，但關鍵字來源不詳。第五欄用空格連接兩個名詞，這裡的關係可以是交集，也可以是聯集，由語句解析單元140來定義。同樣地，關鍵字來源不詳。第六欄指定用查詢名詞關鍵字「臉書」來找關鍵字來源「檔案內容」，及用查詢名詞關鍵字「facebook」來找關鍵字來源「檔名」。由於content與filename都是對應關鍵字來源的同義字，所以語句解析單元140也可以處理。第六欄使用了nor的布林語法，旨在要檔案內容中有「facebook」的檔案，但不能是excel檔案。由於可以擷取到查詢名詞關鍵字與關鍵字來源，語句解析單元140也可以處理這樣的查詢詞句。最後一欄中的「檔案」，指的可以是查詢名詞關鍵字，也可以是關鍵字來源。語句解析單元140優先處理查詢名詞關鍵字，所以「檔案」二字會被視作查詢名詞關鍵字，一個相應的引導語句會被發給傳來該查詢詞句的用戶。要注意的是，關鍵字擷取單元110擷取的至少一檔案名詞關鍵字，與語句解析單元140擷取的至少一查詢名詞關鍵字一樣，來自名詞關鍵字資料庫122。Please refer to Figure 5. Each field in the table in Fig. 5 represents a query type. For the convenience of explanation, the bottom line is added below the query noun keyword, and the source of the keyword is indicated in bold italics. The type in the first column has only a single noun. Obviously, the noun is the query noun keyword. However, at this time, the source of the keyword is lacking, and the amount of the searched file is too large. The sentence parsing unit 140 can issue the guide sentence through the network platform communication unit 130 to obtain the source of the keyword. In the second column, "Is there any relevant information on Facebook?" I pointed out that the query noun keyword is "Facebook", but "relevant information" does not know what kind of keyword source, it is best to issue a guide sentence to confirm the keyword source. The query words in the third column are clearly stated, so it can be seen that "facebook" is the query noun keyword, and "file" (that is, file content) is the source of the keyword. The fourth column uses Brin grammar to join together, you can find the query noun keyword of "Facebook" or "facebook", but the source of the keyword is unknown. The fifth column connects two nouns with spaces. The relationship here may be an intersection or a union, which is defined by the sentence analysis unit 140. Similarly, the source of the keyword is unknown. The sixth column specifies that the query noun keyword "Facebook" is used to find the keyword source "file content", and the query noun keyword "facebook" is used to find the keyword source "file name". Since both content and filename are synonymous words corresponding to the source of the keyword, the sentence parsing unit 140 can also process. The sixth column uses Norn's Brin syntax, which is intended to include "facebook" files in the file content, but not excel files. Since the query noun keywords and keyword sources can be retrieved, the sentence analysis unit 140 can also process such query words and sentences. "File" in the last column refers to either the query noun keyword or the keyword source. The sentence parsing unit 140 preferentially handles query noun keywords, so the word "file" will be regarded as a query noun keyword, and a corresponding leading sentence will be sent to the user who transmits the query term. It should be noted that the at least one file noun keyword retrieved by the keyword retrieving unit 110 is the same as the at least one query noun keyword retrieved by the sentence parsing unit 140, from the noun keyword database 122.

檔案列表提供單元150用以將查詢名詞關鍵字及關鍵字來源對前述的對照表進行查詢、整理所有符合的檔案的檔名於該檔案列表中，及透過網路平台通聯單元130發出該檔案列表。為了對檔案列表提供單元150的作用有較清楚的理解，請見圖4，該圖繪示LINE的介面操作態樣。智慧型手機420的用戶透過LINE，與本系統的連接。利用LINE的API串接，用戶可以與本系統以”聊天”的方式找他想要的資料（檔案）。首先，用戶發出了「有 facebook 的資料嗎?」的查詢詞句。很明顯，查詢詞句中缺了關鍵字來源，語句解析單元140便發出了「請問是與 facebook 相關的檔案內容、檔名還是元資料?」的引導語句，請用戶決定。用戶確定了是內容（檔案內容的同義字）後，檔案列表提供單元150便可進行查表動作，將所有檔案以一個檔案列表發給該用戶。檔案列表可能很長，會分批發出。用戶可以點擊它想要的檔案名稱，此時相對應的檔案便會透過LINE發給他。The file list providing unit 150 is used to query the noun keywords and keyword sources against the aforementioned comparison table, organize the file names of all matching files in the file list, and issue the file list through the network platform communication unit 130 . In order to have a clearer understanding of the function of the file list providing unit 150, please refer to FIG. 4, which illustrates the operation mode of the LINE interface. The user of the smartphone 420 connects to the system through LINE. Using the LINE API connection, the user can find the information (file) he wants with the system in a "chat" way. First, the user sent out the query phrase "Do you have Facebook information?" Obviously, if the source of the keyword is missing in the query phrase, the sentence parsing unit 140 will issue a guide sentence of "Is the file content, file name or metadata related to facebook?" After the user determines the content (synonyms of the file content), the file list providing unit 150 can perform a table lookup operation and send all files to the user in a file list. The file list may be very long and will be sold separately. The user can click on the desired file name, and the corresponding file will be sent to him via LINE.

語句學習單元160用以透過一機器學習演算法，學習獲得該查詢詞句除去該查詢名詞關鍵字及該關鍵字來源後，剩餘文字的複數個等效文字組合，並利用該些等效文字組合協助查詢名詞關鍵字、關鍵字來源或／及查詢詞句語意之判定。請復見圖5，第一欄到第六欄都是要找facebook查詢名詞關鍵字的自然語言。扣除掉了查詢名詞關鍵字及關鍵字來源（如果有的話），會剩下「」（空集合）、「有的相關資料嗎?」、「我要的」、「 or 」、「」（空格）及「 in as 」，這些都是有意願要查詢的等效文字組合。藉由台除該些等效文字組合，其呈現的就更可能是想要的查詢名詞關鍵字及關鍵字來源。圖5第七欄的語意有剔除某些查詢內容的意思，該查詢詞句除去該查詢名詞關鍵字及該關鍵字來源後的剩餘文字，可另列一組等效文字組合。如此依照用戶語意細分多個等效文字組合，便更能有效了解客戶的真實意圖。此外，實作中，前述的機器學習演算法可採用TF-IDF演算法。The sentence learning unit 160 is used to learn and obtain the query words and sentences after removing the query noun keyword and the source of the keyword through a machine learning algorithm, and a plurality of equivalent text combinations of the remaining text, and using these equivalent text combinations to assist Query noun keywords, keyword sources or/and judgment of sentence semantics. Please refer to Figure 5, the first column to the sixth column are to find the natural language of facebook query noun keywords. After deducting the query noun keywords and keyword sources (if any), "" (empty collection), "have any relevant information?", "I want", "or", "" ( Space) and "in as", these are equivalent text combinations that are willing to be queried. By removing these equivalent text combinations, it is more likely to present the desired query noun keywords and keyword sources. The semantic meaning of the seventh column of FIG. 5 has the meaning of excluding certain query content. The query word and sentence can remove a query noun keyword and the remaining text after the keyword source, and a set of equivalent text combinations can be listed. In this way, multiple equivalent text combinations are subdivided according to the user's semantics, so that the true intentions of customers can be more effectively understood. In addition, in practice, the aforementioned machine learning algorithm can use the TF-IDF algorithm.

在另一實施例中，名詞關鍵字資料庫122可進一步儲存每一名詞關鍵字的對應同義名詞關鍵字。這樣的作法是讓客戶下單一查詢名詞關鍵字時，可尋找更多相關的關鍵字。比如當查詢名詞關鍵字是「臉書」時，對應同義名詞關鍵字「facebook」也會被使用來搜尋。在這種情形下，當語句解析單元140擷取到某一查詢名詞關鍵字時，檔案列表提供單元150會以查詢名詞關鍵字、查詢名詞關鍵字的同義名詞關鍵字與關鍵字來源，對該對照表進行查詢、整理所有符合的檔案的檔名於該檔案列表中，並透過網路平台通聯單元130發出該檔案列表。當然，可以先詢問用戶要不要接受這樣的作法再進行。In another embodiment, the noun keyword database 122 may further store the corresponding synonymous noun keywords for each noun keyword. This approach allows customers to find more relevant keywords when they search for a single noun keyword. For example, when the query noun keyword is "Facebook", the corresponding synonymous noun keyword "facebook" will also be used to search. In this case, when the sentence parsing unit 140 retrieves a query noun keyword, the file list providing unit 150 will use the query noun keyword, the query noun keyword synonymous noun keyword and the keyword source The query table is used to query, sort out the file names of all matching files in the file list, and send the file list through the network platform communication unit 130. Of course, you can ask the user if you want to accept this practice before proceeding.

雖然本發明已以實施方式揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。Although the present invention has been disclosed as above in the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. The scope of protection of the present invention shall be subject to the scope defined in the attached patent application.

10:伺服器110:關鍵字擷取單元120:資料庫模組121:檔案資料庫122:名詞關鍵字資料庫123:檔案關鍵字連結資料庫130:網路平台通聯單元140:語句解析單元150:檔案列表提供單元160:語句學習單元20:網路30:網路平台伺服器300:網路平台410:桌上型電腦420:智慧型手機10: server 110: keyword retrieval unit 120: database module 121: file database 122: noun keyword database 123: file keyword connection database 130: network platform communication unit 140: sentence analysis unit 150 : File list providing unit 160: sentence learning unit 20: network 30: network platform server 300: network platform 410: desktop computer 420: smartphone

圖1說明依照本發明實施例的一種利用文字解析結果與自然語言輸入的文件搜尋系統之元件與運作方式；圖2為名詞關鍵字及關鍵字來源的擷取方法步驟流程圖；圖3繪示一檔案內容；圖4繪示一社交APP介面操作態樣；及圖5為用於說明查詢詞句的列表。FIG. 1 illustrates the components and operation of a document search system using text analysis results and natural language input according to an embodiment of the present invention; FIG. 2 is a flow chart of a method for extracting noun keywords and keyword sources; FIG. 3 illustrates A file content; FIG. 4 illustrates a social APP interface operation mode; and FIG. 5 is a list for explaining query words and sentences.

10:伺服器 10: Server

110:關鍵字擷取單元 110: Keyword retrieval unit

120:資料庫模組 120: Database module

121:檔案資料庫 121: File database

122:名詞關鍵字資料庫 122: noun keyword database

123:檔案關鍵字連結資料庫 123: File keyword link database

130:網路平台通聯單元 130: Network platform communication unit

140:語句解析單元 140: statement parsing unit

150:檔案列表提供單元 150: File list providing unit

160:語句學習單元 160: sentence learning unit

20:網路 20: Internet

30:網路平台伺服器 30: Web platform server

300:網路平台 300: Internet platform

410:桌上型電腦 410: Desktop computer

420:智慧型手機 420: Smartphone

Claims

A document search system using text analysis results and natural language input includes: a keyword extraction unit for obtaining the corresponding character encoding of the text information from the file containing the text information, and extracting from the character encoding Retrieve at least one file noun keyword; a database module, including: a file database that stores a plurality of files; a noun keyword database that stores a plurality of noun keywords; and a file keyword link database, Storing a comparison table of each file in the file database and at least one file noun keyword retrieved from the keyword retrieval unit, wherein the comparison table additionally notes keywords of all file noun keywords controlled by the file Source; a network platform communication unit, which operates to connect to a network platform server via the network, receives a query sentence in natural language input from a user of the network platform running on the network platform server, and sends A file list or a guiding sentence for the query term is sent to the server of the network platform for transmission to the user; a sentence parsing unit for extracting at least one query noun keyword and corresponding key from the query term Word source, and when the keyword source is missing, issue the guidance sentence through the network platform communication unit to obtain the keyword source; and a file list providing unit for the query noun keyword and the keyword source Query the comparison table, arrange the file names of all matching files in the file list, and issue the file list through the network platform communication unit; wherein, at least one file noun key retrieved by the keyword retrieval unit The word and at least one query noun keyword retrieved by the sentence analysis unit comes from the noun keyword database.

The document search system using text analysis results and natural language input as described in item 1 of the patent scope further includes a sentence learning unit for learning to obtain the query phrase through a machine learning algorithm and remove the query noun keyword After the source of the keyword, a plurality of equivalent text combinations of the remaining text, and using these equivalent text combinations to assist in querying noun keywords, keyword sources, and/or semantic judgment of query words.

As described in the second item of the patent application, a document search system using text analysis results and natural language input, wherein the machine learning algorithm is a TF-IDF algorithm.

As described in item 1 of the patent application scope, a document search system using text analysis results and natural language input, wherein the noun keyword database further stores corresponding synonymous noun keywords for each noun keyword.

As described in item 4 of the patent application scope, a document search system using text analysis results and natural language input, when the sentence analysis unit retrieves a query noun keyword, the file list providing unit will use the query noun key Word, the synonymous noun keyword of the query noun keyword and the source of the keyword query the comparison table, sort the file names of all matching files in the file list, and issue the file list through the network platform communication unit .

The document search system using text analysis results and natural language input as described in item 1 of the patent application scope, in which the text information type is character encoding, text images, or files with embedded text image objects.

A document search system using text analysis results and natural language input as described in item 1 of the patent application scope, in which text images and files with embedded text image objects are converted into optical character recognition technology (Optical Character Recognition, OCR) into Corresponding character encoding.

The document search system using text analysis results and natural language input as described in item 1 of the patent scope, where the file format is Portable Document Format (PDF), Power Point file format, Power Point file compatible Format, Word file format, Word file compatible format, Excel file format, or Excel file compatible format.

For example, the document search system using text analysis results and natural language input as described in item 1 of the patent application scope, where the source of the keyword is file name, file content, or file metadata (Metadata).

The document search system using text analysis results and natural language input as described in item 1 of the patent application scope, in which the retrieval method for querying noun keywords and keyword sources includes the steps of: (a) Separating query phrases by words ; (B) Give the segmented words part-of-speech; (c) Classify the words that belong to the noun category and appear in the noun keyword database into query noun keywords; and (d) Will belong to the noun category, Words with the same synonyms as the plural preset keyword sources are classified as keyword sources.

For example, the document search system using text analysis results and natural language input as described in item 1 of the patent application scope, where the noun keyword comes from the Chinese sentence structure tree database of the Central Research Institute of the Republic of China.