200844774 九、發明說明: 【發明所屬之技術領域】 本發明涉及一種資料庫清洗系統及方法。 【先前技術】 在現今企業資訊化進程的飛速發展中,企業資訊系統 資料庫的資料量呈爆炸式的增長。如果在資料庫早期的設 計中沒有進行科學合理的規劃設計,加上後期人工大量輸 . 入資料,造成資料庫中的許多核心資料經常出覌殘缺,編 碼格式混亂、亂碼及資料冗餘等現象。在對資料庫進行升 級時,造成大量資料錯誤出現、丟失,影響企業資訊化進 程。現代企業資訊系統升級頻率越來越高,對資料庫中資 料的正確性及完整性要求也越來越嚴格。 資料庫清洗是指為了保證資料庫中的資料符合標準 格式,對資料進行標準化的過程,目的在於減少資料庫中 所存儲的資料的錯誤和不一致性。以往設計者需要手動來 進行貧料庫清洗,對貧料庫格式進行整體轉換。這樣的人 工作業步驟繁瑣、容易出錯,並且效率低。 【發明内容】 鑒於以上内容,有必要提供一種資料庫清洗系統及方 法,其可自動對資料庫中的資料進行標準化處理。 一種貧料庫清洗糸統,包括主機及貧料庫,所述貢料 庫中存儲有資料及該資料的屬性,所述資料屬性包括資料 的編碼格式、字元格式及字元長短,所述主機設置有資料 轉換字典,用於定義髒資料及所述髒資料的修改規則,該 5 200844774 . 主機還包括:資料讀取模組,用於訪問所述資料庫.,獲取 所述資料庫的屬性,從資料庫中讀取一行資料,並獲取該 行貢料的屬性,編碼格式變更核組’用於在所述貢料讀取 模組讀取的資料的編碼格式不符合設定標準時,將該資料 的編碼格式變更為標準的編碼格式;髒資料修改模組,用 於在所述資料讀取模組讀取的資料含有髒資料時,根據所 述資料轉換字典的設置,按照所述髒資料的修改規則,將 所述資料中的髒資料修改為對應的新資料;資料標準化模 組,用於對所述資料讀取模組讀取的資料進行標準化處 理,將經過標準化處理後的新資料導入所述資料庫中。 一種資料庫清洗方法,包括如下步驟:設置資料轉換 字典,定義資料庫中的髒資料及所述髒資料的修改規則; 訪問所述資料庫,獲取所述資料庫的屬性,從資料庫中讀 取一行資料,並獲取該行資料的屬性,所述資料屬性包括 資料的編碼格式、字元格式及字元長短;若所述讀取的資 料的編碼格式不符合設定標準,則將該資料的編碼格式變 % 更為標準的編碼格式;若所述讀取的資料含有髒資料,則 根據所述資料轉換字典的設置,按照所述髒資料的修改規 則,將所述資料中的髒資料修改為對應的新資料;對所述 讀取的資料進行標準化處理,將經過標準化處理後的新資 料導入所述資料庫中;重複所述讀取資料、變更資料編碼 格式、修改髒資料及資料標準化的步驟,直到所述資料庫 中的所有資料處理完畢。 相較於習知技術,所述的資料庫清洗系統及方法,可 6 200844774 以自動將貝料庫中的貢料編碼格式轉換成標準的編碼格 式’將鱗貢料修改成新資料,並對資料庫中的資料進行標 準化處理,避免了手動清洗資料庫的繁瑣步驟,提高了資 料庫清洗的工作效率。 【實施方式】 麥閱圖1所示,係本發明資料庫清洗系統較佳實施例 的系統架構圖。該系統主要包括顯示器i、主機2、鍵盤3 及滑鼠4,所述主機2連接有資料庫5,所述資料庫5為需 要進行資料清洗的資料庫,存财㈣及所述f料的屬 性’所述資料屬性包括:資料編碼格式、字元格式及字元 長短卓。所述主機2中設置有資料轉換字典21,所述資料 =字典21用於定義騎料及所述癖資料的修改規則。所 it:資料庫5中存儲的不完整、錯誤的資料,該癖 1 =型可以由用戶自己來定義。所述修改規則可以是 透過該哈希函數,可以將所述«料轉換 血21的-新立貝料。芬閱圖3所示’係圖1所示資料轉換字 ❹料圖。所述資料轉換字典21中第—列的資料為 笞、了 列的貧料為新資料。透過所述哈希函數的運 :所述=卿5中的髒資料轉換成對應的新資料。假 ΐ種二“函數為Η(Χ) = Υ,其中Η代表轉換規則(即 料轉換函數),义代表舊資料,¥代表新資料。如 f在貝料轉換字典21中定義了 ( 為髒資料’並且定義該辦資 又工0 _中的KM P ㈣切所述髒a 格修改纟_,,,透過該哈希函數的轉換,所述 7 200844774 . 資料庫5中所有的髒資料a@b將修改為a-b。 . 所述主機2用於將所述資料庫5中的資料編碼格式轉 換成標準的編碼格式,將髒資料修改成新資料,並對資料 庫5中的資料進行標準化處理。所述主機2連接有顯示器 1、鍵盤3及滑鼠4,用作對資料庫5清洗過程中的輸入、 輸出設備。 參閱圖2所示,係圖1所示主機2的功能模組圖。所 述主機2包括資料讀取模組210,編碼格式變更模組211, 髒資料修改模組212及資料標準化模組213。本發明所稱 的模組是完成一特定功能的電腦程式段,比程式更適合於 描述軟體在電腦中的執行過程,因此在本發明以下對軟體 描述中都以模組描述。 其中,資料讀取模組210用於訪問所述資料庫5,獲 取所述資料庫5的屬性(資料庫版本、類型等),從資料庫 5中讀取一行資料(從第一行開始依次往下讀取),並獲取 該行資料的屬性。其中,所述資料屬性包括:資料編碼格 式、字元格式及字元長短等。 編碼格式變更模組211用於判斷所述資料讀取模組 210讀取的資料的編碼格式是否符合標準。如果所述資料 的編碼格式不符合標準’則將該貧料的編碼格式變更為標 準的編碼格式。所述資料編碼格式的標準可以由用戶自己 定義,例如用戶定義資料編碼格式的標準為UTF-8,如果 貢料讀取模組210 Ί買取的貧料編碼格式不是UTF-8 ’則編 碼格式變更模組211會將該資料的編碼格式變更為 200844774 . UTF-8。 髒資料修改模組212用於根據所述資料轉換字典21 的設置,判斷所述資料讀取模組210讀取的資料是否含有 髒資料。如果所述資料中含有髒資料,則根據元素化分離 識別演算法,將所述資料進行分解,按照所述髒資料的修 改規則,將其中的髒資料修改為對應的新資料。 資料標準化模組213用於對所述資料讀取模組210讀 取的資料進行標準化處理,將經過標準化處理後的新資料 導入所述資料庫5中。具體而言,如果所述資料讀取模組 210讀取資料為字元,則資料標準化模組213判斷所述字 元的格式是否符合標準。如果所述字元的格式不符合標 準,則將該字元格式變更為標準的字元格式。所述字元格 式的標準可以由用戶自己定義,例如用戶定義字元格式的 標準為8個位元組長,如果資料讀取模組210讀取的資料 字元格式不是8個位元組長,則資料標準化模組213會將 該資料的字元格式變更為8個位元組長。如果所述資料讀 取模組210讀取資料為整數,則資料標準化模組213判斷 所述整數的範圍是否符合標準。如果所述整數的範圍不符 合標準,則將該整數變更為符合標準範圍的整數。所述整 數範圍的標準可以由用戶自己定義,例如用戶定義整數範 圍的標準為0-255,如果資料讀取模組210讀取的整數為 -1,則資料標準化模組213會將該整數-1變更為+1或其他 用戶定義的數值。然後,資料標準化模組213對修改後的 新資料進行一致性檢測與修復,並對所述修改後的新資料 9 200844774 • 進行壓縮,將壓縮後的新資料導入所述資料庫5中.。 資料標準化模組213還用於判斷所述資料庫5中的資 料是否讀取完畢,如果沒有讀取完畢,則資料讀取模組210 讀取資料庫5中的下一行的資料。 參閱圖4所示,是本發明資料庫清洗方法較佳實施例 的流程圖。首先,步驟S401,資料讀取模組210訪問所述 資料庫5,獲取所述資料庫5的屬性(資料庫版本、類型 等)。 步驟S402,資料讀取模組210從所述資料庫5中讀取 一行資料(從第一行開始依次往下讀取),並獲取該行資料 的屬性。其中5所述資料屬性包括·貢料編碼格式、字元 格式及字元長短等。 步驟S403,編碼格式變更模組211判斷所述資料讀取 模組210讀取的資料的編碼格式是否符合標準。如果所述 資料的編碼格式符合標準,則流程直接轉到步驟S405。 、 如果所述資料的編碼格式不符合標準,步驟S404,編 碼格式變更模組211將該資料的編碼格式變更為標準的編 碼格式。所述資料編碼格式的標準可以由用戶自己定義, 例如用戶定義資料編碼格式的標準為UTF-8,如果資料讀 取模組210讀取的資料編碼格式不是UTF-8,則編碼格式 變更模組211會將該資料的編碼格式變更為UTF-8。 步驟S405,髒資料修改模組212根據所述資料轉換字 典21的設置,判斷所述資料讀取模組210讀取的資料是否 含有髒資料。如果所述資料中不含髒資料,則流程直接轉 10 200844774 到步驟S407。所述資料轉換字典21用於定義髒資料及所 述髒資料的修改規則。所述髒資料即資料庫5中存儲的不 完整、錯誤的資料,該髒資料的類型可以由用戶自己來定 義。所述修改規則可以是一種哈希函數,透過該哈希函數, 可以將所述髒資料轉換成正確的新資料。如圖3所示,所 述資料轉換字典21中第一列的資料為髒資料,第二列的資 料為新資料。透過所述哈希函數的運算,可以將資料庫5 中的髒資料轉換成對應的新資料。假設所述哈希函數為Η (Χ) = Υ,其中Η代表轉換規則(即某一種資料轉換函數), X代表舊資料,Υ代表新資料。如果用戶在資料轉換字典 21中定義了 cde為髒資料,並且定義該髒資料的轉換規則 Η為將所述髒資料cde修改為dec,透過該哈希函數的轉 換,所述資料庫5中所有的髒資料cde將修改為dec。 如果所述資料中含有髒資料,步驟S406,髒資料修改 模組212根據元素化分離識別演鼻法’將所述貧料進行分 解,按照所述髒資料的修改規則,將其中的髒資料修改為 對應的新數。 步驟S407,資料標準化模組213對所述資料讀取模組 210讀取的資料進行標準化處理,將經過標準化處理後的 新資料導入所述資料庫5中。 具體而言,如果所述資料讀取模組210讀取資料為字 元,則資料標準化模組213判斷所述字元的格式是否符合 標準。如果所述字元的格式不符合標準,則將該字元格式 變更為標準的字元格式。所述字元格式的標準可以由用戶 11 200844774 . 自己定義,例如用戶定義字元格式的標準為8個位元組 長,如果資料讀取模組210讀取的資料字元格式不是8個 位元組長,則資料標準化模組213會將該資料的字元格式 變更為8個位元組長。如果所述資料讀取模組210讀取資 料為整數,則資料標準化模組213判斷所述整數的範圍是 否符合標準。如果所述整數的範圍不符合標準,則將該整 數變更為符合標準範圍的整數。所述整數範圍的標準可以 由用戶自己定義,例如用戶定義整數範圍的標準為0-255, 如果資料讀取模組210讀取的整數為-1,則資料標準化模 組213會將該整數-1變更為+1或其他用戶定義的數值。然 後,資料標準化模組213對修改後的新資料進行一致性檢 測與修復,並對所述修改後的新資料進行壓縮,將壓縮後 的新資料導入所述資料庫5中。 步驟S408,資料標準化模組213判斷所述資料庫5中 的資料是否讀取完畢,如果沒有讀取完畢,則流程回到步 驟S402。如果所述資料庫5中的資料讀取完畢,則流程結 束。 本發明資料庫清洗系統及方法,雖以較佳實施例揭露 如上,然其並非用以限定本發明。任何熟悉此項技藝之人 士,在不脫離本發明之精神及範圍内,當可做更動與潤飾, 因此本發明之保護範圍當視後附之申請專利範圍所界定者 為準。 【圖式簡單說明】 圖1係本發明資料庫清洗系統較佳實施例的硬體架構 12 200844774 . 圖。 圖2係圖1所示主機的功能模組圖。 圖3係圖1所示資料轉換字典的示意圖。 圖4係本發明資料庫清洗方法較佳實施例的流程圖。 【主要元件符號說明】 顯示器 1 主機 2 鍵盤 3 滑鼠 4 資料庫 5 資料轉換字典 21 資料讀取模組 210 編碼格式變更模組 211 髒資料修改模組 212 資料標準化模組 213 13200844774 IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a database cleaning system and method. [Prior Art] In the rapid development of today's enterprise informationization process, the amount of information in the enterprise information system database has exploded. If there is no scientific and reasonable planning and design in the early design of the database, and the large amount of artificial input and data in the later stage, many core data in the database are often defective, the coding format is confusing, garbled and data redundancy. . When the database is upgraded, a large amount of data errors occur and are lost, which affects the enterprise informationization process. Modern enterprise information systems are being upgraded more frequently, and the requirements for the correctness and integrity of the data in the database are becoming stricter. Database cleaning refers to the process of standardizing data in order to ensure that the data in the database conforms to the standard format, with the aim of reducing errors and inconsistencies in the data stored in the database. In the past, designers had to manually clean the poor storage and perform an overall conversion of the poor storage format. Such people are cumbersome, error-prone, and inefficient. SUMMARY OF THE INVENTION In view of the above, it is necessary to provide a database cleaning system and method that can automatically standardize data in a database. A poor storage cleaning system includes a host and a poor storage library, wherein the tributary library stores data and attributes of the data, the data attributes including a data encoding format, a character format, and a character length, The host device is provided with a data conversion dictionary for defining the dirty data and the modification rule of the dirty data, and the host further includes: a data reading module, configured to access the database, and obtain the database. Attribute, read a line of data from the database, and obtain the attributes of the line tribute, the encoding format change core group 'for the encoding format of the data read by the tributary reading module does not meet the set standard, will The encoding format of the data is changed to a standard encoding format; the dirty data modifying module is configured to: when the data read by the data reading module contains dirty data, according to the setting of the data conversion dictionary, according to the dirty The data modification rule modifies the dirty data in the data to the corresponding new data; the data standardization module is configured to standardize the data read by the data reading module, The newly processed data is imported into the database. A database cleaning method includes the following steps: setting a data conversion dictionary, defining dirty data in the database and modifying rules of the dirty data; accessing the database, obtaining attributes of the database, and reading from the database Taking a row of data and obtaining attributes of the data of the row, the data attribute includes a coding format of the data, a character format, and a character length; if the encoded format of the read data does not meet the set standard, the data is The encoding format is changed to a more standard encoding format; if the read data contains dirty data, according to the setting of the data conversion dictionary, the dirty data in the data is modified according to the modification rule of the dirty data. Corresponding new data; normalizing the read data, importing the standardized new data into the database; repeating the reading data, changing the data encoding format, modifying the dirty data, and standardizing the data The steps until all the data in the database has been processed. Compared with the prior art, the database cleaning system and method can automatically convert the metric coding format in the billet library into a standard encoding format to modify the metrics into new materials, and The data in the database is standardized, which avoids the cumbersome steps of manually cleaning the database and improves the efficiency of database cleaning. [Embodiment] Fig. 1 is a system architecture diagram of a preferred embodiment of the database cleaning system of the present invention. The system mainly includes a display i, a host 2, a keyboard 3 and a mouse 4. The host 2 is connected with a database 5, and the database 5 is a database for data cleaning, and the deposit (4) and the f material are The attribute attributes of the attribute include: data encoding format, character format and character length. A data conversion dictionary 21 is provided in the host 2, and the data = dictionary 21 is used to define a modification rule of the riding material and the data. It: Incomplete and erroneous data stored in the database 5, the 癖 1 = type can be defined by the user himself. The modification rule may be that the material can be converted into a new material by the hash function. See Figure 3 for the data conversion word shown in Figure 1. The data in the first column of the data conversion dictionary 21 is 笞, and the poor materials in the column are new data. Through the operation of the hash function: the dirty data in the = Qing 5 is converted into corresponding new data. False 二 2 "The function is Η(Χ) = Υ, where Η represents the conversion rule (ie, the material conversion function), the meaning represents the old data, and the ¥ represents the new data. For example, f is defined in the bedding conversion dictionary 21 (for dirty) The data 'and define the KM P in the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ @b will be modified to ab. The host 2 is used to convert the data encoding format in the database 5 into a standard encoding format, modify the dirty data into new data, and standardize the data in the database 5. The host 2 is connected with a display 1, a keyboard 3 and a mouse 4, and is used as an input and output device in the cleaning process of the database 5. Referring to FIG. 2, it is a functional module diagram of the host 2 shown in FIG. The host 2 includes a data reading module 210, an encoding format changing module 211, a dirty data modifying module 212, and a data standardizing module 213. The module referred to in the present invention is a computer program segment that performs a specific function. More suitable than the program to describe the execution process of the software in the computer Therefore, in the following description of the software in the present invention, the module description is used. The data reading module 210 is used to access the database 5, and obtain the attributes (database version, type, etc.) of the database 5, Reading a row of data from the database 5 (reading down from the first row), and obtaining attributes of the data of the row, wherein the data attributes include: data encoding format, character format, character length, etc. The encoding format change module 211 is configured to determine whether the encoding format of the data read by the data reading module 210 conforms to the standard. If the encoding format of the data does not conform to the standard, the encoding format of the poor material is changed to The standard encoding format. The standard of the data encoding format can be defined by the user. For example, the standard of the user-defined data encoding format is UTF-8. If the metric reading module 210 Ί buys the poor encoding format is not UTF-8. The code format change module 211 changes the code format of the data to 200844774. UTF-8. The dirty data modification module 212 is configured to determine according to the setting of the data conversion dictionary 21. Destroying whether the data read by the data reading module 210 contains dirty data. If the data contains dirty data, the data is decomposed according to the elemental separation and recognition algorithm, according to the modification of the dirty data. The rule is to modify the dirty data into the corresponding new data. The data standardization module 213 is configured to perform normalization processing on the data read by the data reading module 210, and import the newly processed data into the data. In the library 5. Specifically, if the data reading module 210 reads the data as a character, the data normalization module 213 determines whether the format of the character conforms to the standard. If the format of the character does not match Standard, the character format is changed to the standard character format. The standard of the character format can be defined by the user. For example, the standard of the user-defined character format is 8 bytes long. If the data character format read by the data reading module 210 is not 8 bytes long, then The data standardization module 213 changes the character format of the data to 8 bytes. If the data reading module 210 reads the data as an integer, the data normalization module 213 determines whether the range of the integer conforms to the standard. If the range of the integer does not conform to the standard, the integer is changed to an integer that conforms to the standard range. The standard of the integer range may be defined by the user. For example, the standard of the user-defined integer range is 0-255. If the integer read by the data reading module 210 is -1, the data normalization module 213 will use the integer- 1 Change to +1 or other user-defined values. Then, the data standardization module 213 performs consistency detection and repair on the modified new data, and compresses the modified new data 9 200844774 • and imports the compressed new data into the database 5. The data standardization module 213 is further configured to determine whether the data in the database 5 has been read. If the reading is not completed, the data reading module 210 reads the data of the next row in the database 5. Referring to Figure 4, there is shown a flow chart of a preferred embodiment of the database cleaning method of the present invention. First, in step S401, the data reading module 210 accesses the database 5 to acquire the attributes (database version, type, etc.) of the database 5. In step S402, the data reading module 210 reads a line of data from the database 5 (reads down from the first line) and acquires attributes of the line data. Among them, the data attributes include the tribute coding format, the character format and the length of the characters. In step S403, the encoding format changing module 211 determines whether the encoding format of the data read by the data reading module 210 conforms to the standard. If the encoding format of the material conforms to the standard, the flow proceeds directly to step S405. If the encoding format of the data does not conform to the standard, in step S404, the encoding format changing module 211 changes the encoding format of the data to a standard encoding format. The standard of the data encoding format can be defined by the user. For example, the standard of the user-defined data encoding format is UTF-8. If the data encoding format read by the data reading module 210 is not UTF-8, the encoding format changing module 211 will change the encoding format of the material to UTF-8. In step S405, the dirty data modification module 212 determines whether the data read by the data reading module 210 contains dirty data according to the setting of the data conversion dictionary 21. If the data does not contain dirty data, the flow proceeds directly to 10 200844774 to step S407. The data conversion dictionary 21 is used to define dirty data and modification rules of the dirty data. The dirty data is incomplete and erroneous data stored in the database 5, and the type of the dirty data can be defined by the user himself. The modification rule may be a hash function through which the dirty data can be converted into correct new data. As shown in Fig. 3, the data in the first column of the data conversion dictionary 21 is dirty data, and the data in the second column is new data. Through the operation of the hash function, the dirty data in the database 5 can be converted into corresponding new data. Suppose the hash function is Η (Χ) = Υ, where Η represents the conversion rule (ie, a data conversion function), X represents the old data, and Υ represents the new data. If the user defines cde as dirty data in the data conversion dictionary 21, and the conversion rule for defining the dirty data is to modify the dirty data cde to dec, through the conversion of the hash function, all the data in the database 5 The dirty data cde will be modified to dec. If the data contains dirty data, in step S406, the dirty data modification module 212 decomposes the poor material according to the elemental separation and recognition method, and modifies the dirty data according to the modification rule of the dirty data. For the corresponding new number. In step S407, the data standardization module 213 normalizes the data read by the data reading module 210, and introduces the standardized new data into the database 5. Specifically, if the data reading module 210 reads the data as a character, the data normalization module 213 determines whether the format of the character conforms to the standard. If the format of the character does not conform to the standard, the character format is changed to the standard character format. The standard of the character format can be defined by the user 11 200844774. The standard of the user-defined character format is 8 bytes long. If the data character format read by the data reading module 210 is not 8 bits. The leader of the data standardization module 213 will change the character format of the data to 8 bytes. If the data reading module 210 reads the data as an integer, the data normalization module 213 determines whether the range of the integers conforms to the standard. If the range of integers does not meet the criteria, the integer is changed to an integer that conforms to the standard range. The standard of the integer range can be defined by the user. For example, the standard of the user-defined integer range is 0-255. If the integer read by the data reading module 210 is -1, the data normalization module 213 will use the integer- 1 Change to +1 or other user-defined values. Then, the data standardization module 213 performs consistency detection and repair on the modified new data, compresses the modified new data, and imports the compressed new data into the database 5. In step S408, the data standardization module 213 determines whether the data in the database 5 has been read. If the reading is not completed, the flow returns to step S402. If the data in the database 5 is read, the flow ends. The database cleaning system and method of the present invention are disclosed above in the preferred embodiments, and are not intended to limit the present invention. Any person skilled in the art will be able to make modifications and refinements without departing from the spirit and scope of the invention, and the scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a hardware architecture of a preferred embodiment of the database cleaning system of the present invention. FIG. 2 is a functional module diagram of the host shown in FIG. 1. FIG. 3 is a schematic diagram of the data conversion dictionary shown in FIG. 1. 4 is a flow chart of a preferred embodiment of the database cleaning method of the present invention. [Main component symbol description] Display 1 Host 2 Keyboard 3 Mouse 4 Database 5 Data conversion dictionary 21 Data reading module 210 Encoding format change module 211 Dirty data modification module 212 Data standardization module 213 13