TW200844774A - System and method for updating a database - Google Patents

System and method for updating a database Download PDF

Info

Publication number
TW200844774A
TW200844774A TW96116802A TW96116802A TW200844774A TW 200844774 A TW200844774 A TW 200844774A TW 96116802 A TW96116802 A TW 96116802A TW 96116802 A TW96116802 A TW 96116802A TW 200844774 A TW200844774 A TW 200844774A
Authority
TW
Taiwan
Prior art keywords
data
database
dirty
format
module
Prior art date
Application number
TW96116802A
Other languages
Chinese (zh)
Other versions
TWI402697B (en
Inventor
Hao Yan
Original Assignee
Hon Hai Prec Ind Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hon Hai Prec Ind Co Ltd filed Critical Hon Hai Prec Ind Co Ltd
Priority to TW96116802A priority Critical patent/TWI402697B/en
Publication of TW200844774A publication Critical patent/TW200844774A/en
Application granted granted Critical
Publication of TWI402697B publication Critical patent/TWI402697B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a system for updating a database. The system includes a computer and a database, the stores data and attribute of the data, the computer includes: a data reading module for reading data in line from the database, and obtaining the attribute of the data; a coded format modifying module for changing coded format of the data to standard coded format; a dirty data modifying module for changing dirty data in the database to new data; and a data standardization module for doing data standardization of the data read from the database. A method for updating a database is also provided. The invention can process data standardization of a database automatically.

Description

200844774 九、發明說明: 【發明所屬之技術領域】 本發明涉及一種資料庫清洗系統及方法。 【先前技術】 在現今企業資訊化進程的飛速發展中,企業資訊系統 資料庫的資料量呈爆炸式的增長。如果在資料庫早期的設 計中沒有進行科學合理的規劃設計,加上後期人工大量輸 . 入資料,造成資料庫中的許多核心資料經常出覌殘缺,編 碼格式混亂、亂碼及資料冗餘等現象。在對資料庫進行升 級時,造成大量資料錯誤出現、丟失,影響企業資訊化進 程。現代企業資訊系統升級頻率越來越高,對資料庫中資 料的正確性及完整性要求也越來越嚴格。 資料庫清洗是指為了保證資料庫中的資料符合標準 格式,對資料進行標準化的過程,目的在於減少資料庫中 所存儲的資料的錯誤和不一致性。以往設計者需要手動來 進行貧料庫清洗,對貧料庫格式進行整體轉換。這樣的人 工作業步驟繁瑣、容易出錯,並且效率低。 【發明内容】 鑒於以上内容,有必要提供一種資料庫清洗系統及方 法,其可自動對資料庫中的資料進行標準化處理。 一種貧料庫清洗糸統,包括主機及貧料庫,所述貢料 庫中存儲有資料及該資料的屬性,所述資料屬性包括資料 的編碼格式、字元格式及字元長短,所述主機設置有資料 轉換字典,用於定義髒資料及所述髒資料的修改規則,該 5 200844774 . 主機還包括:資料讀取模組,用於訪問所述資料庫.,獲取 所述資料庫的屬性,從資料庫中讀取一行資料,並獲取該 行貢料的屬性,編碼格式變更核組’用於在所述貢料讀取 模組讀取的資料的編碼格式不符合設定標準時,將該資料 的編碼格式變更為標準的編碼格式;髒資料修改模組,用 於在所述資料讀取模組讀取的資料含有髒資料時,根據所 述資料轉換字典的設置,按照所述髒資料的修改規則,將 所述資料中的髒資料修改為對應的新資料;資料標準化模 組,用於對所述資料讀取模組讀取的資料進行標準化處 理,將經過標準化處理後的新資料導入所述資料庫中。 一種資料庫清洗方法,包括如下步驟:設置資料轉換 字典,定義資料庫中的髒資料及所述髒資料的修改規則; 訪問所述資料庫,獲取所述資料庫的屬性,從資料庫中讀 取一行資料,並獲取該行資料的屬性,所述資料屬性包括 資料的編碼格式、字元格式及字元長短;若所述讀取的資 料的編碼格式不符合設定標準,則將該資料的編碼格式變 % 更為標準的編碼格式;若所述讀取的資料含有髒資料,則 根據所述資料轉換字典的設置,按照所述髒資料的修改規 則,將所述資料中的髒資料修改為對應的新資料;對所述 讀取的資料進行標準化處理,將經過標準化處理後的新資 料導入所述資料庫中;重複所述讀取資料、變更資料編碼 格式、修改髒資料及資料標準化的步驟,直到所述資料庫 中的所有資料處理完畢。 相較於習知技術,所述的資料庫清洗系統及方法,可 6 200844774 以自動將貝料庫中的貢料編碼格式轉換成標準的編碼格 式’將鱗貢料修改成新資料,並對資料庫中的資料進行標 準化處理,避免了手動清洗資料庫的繁瑣步驟,提高了資 料庫清洗的工作效率。 【實施方式】 麥閱圖1所示,係本發明資料庫清洗系統較佳實施例 的系統架構圖。該系統主要包括顯示器i、主機2、鍵盤3 及滑鼠4,所述主機2連接有資料庫5,所述資料庫5為需 要進行資料清洗的資料庫,存财㈣及所述f料的屬 性’所述資料屬性包括:資料編碼格式、字元格式及字元 長短卓。所述主機2中設置有資料轉換字典21,所述資料 =字典21用於定義騎料及所述癖資料的修改規則。所 it:資料庫5中存儲的不完整、錯誤的資料,該癖 1 =型可以由用戶自己來定義。所述修改規則可以是 透過該哈希函數,可以將所述«料轉換 血21的-新立貝料。芬閱圖3所示’係圖1所示資料轉換字 ❹料圖。所述資料轉換字典21中第—列的資料為 笞、了 列的貧料為新資料。透過所述哈希函數的運 :所述=卿5中的髒資料轉換成對應的新資料。假 ΐ種二“函數為Η(Χ) = Υ,其中Η代表轉換規則(即 料轉換函數),义代表舊資料,¥代表新資料。如 f在貝料轉換字典21中定義了 ( 為髒資料’並且定義該辦資 又工0 _中的KM P ㈣切所述髒a 格修改纟_,,,透過該哈希函數的轉換,所述 7 200844774 . 資料庫5中所有的髒資料a@b將修改為a-b。 . 所述主機2用於將所述資料庫5中的資料編碼格式轉 換成標準的編碼格式,將髒資料修改成新資料,並對資料 庫5中的資料進行標準化處理。所述主機2連接有顯示器 1、鍵盤3及滑鼠4,用作對資料庫5清洗過程中的輸入、 輸出設備。 參閱圖2所示,係圖1所示主機2的功能模組圖。所 述主機2包括資料讀取模組210,編碼格式變更模組211, 髒資料修改模組212及資料標準化模組213。本發明所稱 的模組是完成一特定功能的電腦程式段,比程式更適合於 描述軟體在電腦中的執行過程,因此在本發明以下對軟體 描述中都以模組描述。 其中,資料讀取模組210用於訪問所述資料庫5,獲 取所述資料庫5的屬性(資料庫版本、類型等),從資料庫 5中讀取一行資料(從第一行開始依次往下讀取),並獲取 該行資料的屬性。其中,所述資料屬性包括:資料編碼格 式、字元格式及字元長短等。 編碼格式變更模組211用於判斷所述資料讀取模組 210讀取的資料的編碼格式是否符合標準。如果所述資料 的編碼格式不符合標準’則將該貧料的編碼格式變更為標 準的編碼格式。所述資料編碼格式的標準可以由用戶自己 定義,例如用戶定義資料編碼格式的標準為UTF-8,如果 貢料讀取模組210 Ί買取的貧料編碼格式不是UTF-8 ’則編 碼格式變更模組211會將該資料的編碼格式變更為 200844774 . UTF-8。 髒資料修改模組212用於根據所述資料轉換字典21 的設置,判斷所述資料讀取模組210讀取的資料是否含有 髒資料。如果所述資料中含有髒資料,則根據元素化分離 識別演算法,將所述資料進行分解,按照所述髒資料的修 改規則,將其中的髒資料修改為對應的新資料。 資料標準化模組213用於對所述資料讀取模組210讀 取的資料進行標準化處理,將經過標準化處理後的新資料 導入所述資料庫5中。具體而言,如果所述資料讀取模組 210讀取資料為字元,則資料標準化模組213判斷所述字 元的格式是否符合標準。如果所述字元的格式不符合標 準,則將該字元格式變更為標準的字元格式。所述字元格 式的標準可以由用戶自己定義,例如用戶定義字元格式的 標準為8個位元組長,如果資料讀取模組210讀取的資料 字元格式不是8個位元組長,則資料標準化模組213會將 該資料的字元格式變更為8個位元組長。如果所述資料讀 取模組210讀取資料為整數,則資料標準化模組213判斷 所述整數的範圍是否符合標準。如果所述整數的範圍不符 合標準,則將該整數變更為符合標準範圍的整數。所述整 數範圍的標準可以由用戶自己定義,例如用戶定義整數範 圍的標準為0-255,如果資料讀取模組210讀取的整數為 -1,則資料標準化模組213會將該整數-1變更為+1或其他 用戶定義的數值。然後,資料標準化模組213對修改後的 新資料進行一致性檢測與修復,並對所述修改後的新資料 9 200844774 • 進行壓縮,將壓縮後的新資料導入所述資料庫5中.。 資料標準化模組213還用於判斷所述資料庫5中的資 料是否讀取完畢,如果沒有讀取完畢,則資料讀取模組210 讀取資料庫5中的下一行的資料。 參閱圖4所示,是本發明資料庫清洗方法較佳實施例 的流程圖。首先,步驟S401,資料讀取模組210訪問所述 資料庫5,獲取所述資料庫5的屬性(資料庫版本、類型 等)。 步驟S402,資料讀取模組210從所述資料庫5中讀取 一行資料(從第一行開始依次往下讀取),並獲取該行資料 的屬性。其中5所述資料屬性包括·貢料編碼格式、字元 格式及字元長短等。 步驟S403,編碼格式變更模組211判斷所述資料讀取 模組210讀取的資料的編碼格式是否符合標準。如果所述 資料的編碼格式符合標準,則流程直接轉到步驟S405。 、 如果所述資料的編碼格式不符合標準,步驟S404,編 碼格式變更模組211將該資料的編碼格式變更為標準的編 碼格式。所述資料編碼格式的標準可以由用戶自己定義, 例如用戶定義資料編碼格式的標準為UTF-8,如果資料讀 取模組210讀取的資料編碼格式不是UTF-8,則編碼格式 變更模組211會將該資料的編碼格式變更為UTF-8。 步驟S405,髒資料修改模組212根據所述資料轉換字 典21的設置,判斷所述資料讀取模組210讀取的資料是否 含有髒資料。如果所述資料中不含髒資料,則流程直接轉 10 200844774 到步驟S407。所述資料轉換字典21用於定義髒資料及所 述髒資料的修改規則。所述髒資料即資料庫5中存儲的不 完整、錯誤的資料,該髒資料的類型可以由用戶自己來定 義。所述修改規則可以是一種哈希函數,透過該哈希函數, 可以將所述髒資料轉換成正確的新資料。如圖3所示,所 述資料轉換字典21中第一列的資料為髒資料,第二列的資 料為新資料。透過所述哈希函數的運算,可以將資料庫5 中的髒資料轉換成對應的新資料。假設所述哈希函數為Η (Χ) = Υ,其中Η代表轉換規則(即某一種資料轉換函數), X代表舊資料,Υ代表新資料。如果用戶在資料轉換字典 21中定義了 cde為髒資料,並且定義該髒資料的轉換規則 Η為將所述髒資料cde修改為dec,透過該哈希函數的轉 換,所述資料庫5中所有的髒資料cde將修改為dec。 如果所述資料中含有髒資料,步驟S406,髒資料修改 模組212根據元素化分離識別演鼻法’將所述貧料進行分 解,按照所述髒資料的修改規則,將其中的髒資料修改為 對應的新數。 步驟S407,資料標準化模組213對所述資料讀取模組 210讀取的資料進行標準化處理,將經過標準化處理後的 新資料導入所述資料庫5中。 具體而言,如果所述資料讀取模組210讀取資料為字 元,則資料標準化模組213判斷所述字元的格式是否符合 標準。如果所述字元的格式不符合標準,則將該字元格式 變更為標準的字元格式。所述字元格式的標準可以由用戶 11 200844774 . 自己定義,例如用戶定義字元格式的標準為8個位元組 長,如果資料讀取模組210讀取的資料字元格式不是8個 位元組長,則資料標準化模組213會將該資料的字元格式 變更為8個位元組長。如果所述資料讀取模組210讀取資 料為整數,則資料標準化模組213判斷所述整數的範圍是 否符合標準。如果所述整數的範圍不符合標準,則將該整 數變更為符合標準範圍的整數。所述整數範圍的標準可以 由用戶自己定義,例如用戶定義整數範圍的標準為0-255, 如果資料讀取模組210讀取的整數為-1,則資料標準化模 組213會將該整數-1變更為+1或其他用戶定義的數值。然 後,資料標準化模組213對修改後的新資料進行一致性檢 測與修復,並對所述修改後的新資料進行壓縮,將壓縮後 的新資料導入所述資料庫5中。 步驟S408,資料標準化模組213判斷所述資料庫5中 的資料是否讀取完畢,如果沒有讀取完畢,則流程回到步 驟S402。如果所述資料庫5中的資料讀取完畢,則流程結 束。 本發明資料庫清洗系統及方法,雖以較佳實施例揭露 如上,然其並非用以限定本發明。任何熟悉此項技藝之人 士,在不脫離本發明之精神及範圍内,當可做更動與潤飾, 因此本發明之保護範圍當視後附之申請專利範圍所界定者 為準。 【圖式簡單說明】 圖1係本發明資料庫清洗系統較佳實施例的硬體架構 12 200844774 . 圖。 圖2係圖1所示主機的功能模組圖。 圖3係圖1所示資料轉換字典的示意圖。 圖4係本發明資料庫清洗方法較佳實施例的流程圖。 【主要元件符號說明】 顯示器 1 主機 2 鍵盤 3 滑鼠 4 資料庫 5 資料轉換字典 21 資料讀取模組 210 編碼格式變更模組 211 髒資料修改模組 212 資料標準化模組 213 13200844774 IX. Description of the Invention: [Technical Field of the Invention] The present invention relates to a database cleaning system and method. [Prior Art] In the rapid development of today's enterprise informationization process, the amount of information in the enterprise information system database has exploded. If there is no scientific and reasonable planning and design in the early design of the database, and the large amount of artificial input and data in the later stage, many core data in the database are often defective, the coding format is confusing, garbled and data redundancy. . When the database is upgraded, a large amount of data errors occur and are lost, which affects the enterprise informationization process. Modern enterprise information systems are being upgraded more frequently, and the requirements for the correctness and integrity of the data in the database are becoming stricter. Database cleaning refers to the process of standardizing data in order to ensure that the data in the database conforms to the standard format, with the aim of reducing errors and inconsistencies in the data stored in the database. In the past, designers had to manually clean the poor storage and perform an overall conversion of the poor storage format. Such people are cumbersome, error-prone, and inefficient. SUMMARY OF THE INVENTION In view of the above, it is necessary to provide a database cleaning system and method that can automatically standardize data in a database. A poor storage cleaning system includes a host and a poor storage library, wherein the tributary library stores data and attributes of the data, the data attributes including a data encoding format, a character format, and a character length, The host device is provided with a data conversion dictionary for defining the dirty data and the modification rule of the dirty data, and the host further includes: a data reading module, configured to access the database, and obtain the database. Attribute, read a line of data from the database, and obtain the attributes of the line tribute, the encoding format change core group 'for the encoding format of the data read by the tributary reading module does not meet the set standard, will The encoding format of the data is changed to a standard encoding format; the dirty data modifying module is configured to: when the data read by the data reading module contains dirty data, according to the setting of the data conversion dictionary, according to the dirty The data modification rule modifies the dirty data in the data to the corresponding new data; the data standardization module is configured to standardize the data read by the data reading module, The newly processed data is imported into the database. A database cleaning method includes the following steps: setting a data conversion dictionary, defining dirty data in the database and modifying rules of the dirty data; accessing the database, obtaining attributes of the database, and reading from the database Taking a row of data and obtaining attributes of the data of the row, the data attribute includes a coding format of the data, a character format, and a character length; if the encoded format of the read data does not meet the set standard, the data is The encoding format is changed to a more standard encoding format; if the read data contains dirty data, according to the setting of the data conversion dictionary, the dirty data in the data is modified according to the modification rule of the dirty data. Corresponding new data; normalizing the read data, importing the standardized new data into the database; repeating the reading data, changing the data encoding format, modifying the dirty data, and standardizing the data The steps until all the data in the database has been processed. Compared with the prior art, the database cleaning system and method can automatically convert the metric coding format in the billet library into a standard encoding format to modify the metrics into new materials, and The data in the database is standardized, which avoids the cumbersome steps of manually cleaning the database and improves the efficiency of database cleaning. [Embodiment] Fig. 1 is a system architecture diagram of a preferred embodiment of the database cleaning system of the present invention. The system mainly includes a display i, a host 2, a keyboard 3 and a mouse 4. The host 2 is connected with a database 5, and the database 5 is a database for data cleaning, and the deposit (4) and the f material are The attribute attributes of the attribute include: data encoding format, character format and character length. A data conversion dictionary 21 is provided in the host 2, and the data = dictionary 21 is used to define a modification rule of the riding material and the data. It: Incomplete and erroneous data stored in the database 5, the 癖 1 = type can be defined by the user himself. The modification rule may be that the material can be converted into a new material by the hash function. See Figure 3 for the data conversion word shown in Figure 1. The data in the first column of the data conversion dictionary 21 is 笞, and the poor materials in the column are new data. Through the operation of the hash function: the dirty data in the = Qing 5 is converted into corresponding new data. False 二 2 "The function is Η(Χ) = Υ, where Η represents the conversion rule (ie, the material conversion function), the meaning represents the old data, and the ¥ represents the new data. For example, f is defined in the bedding conversion dictionary 21 (for dirty) The data 'and define the KM P in the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ @b will be modified to ab. The host 2 is used to convert the data encoding format in the database 5 into a standard encoding format, modify the dirty data into new data, and standardize the data in the database 5. The host 2 is connected with a display 1, a keyboard 3 and a mouse 4, and is used as an input and output device in the cleaning process of the database 5. Referring to FIG. 2, it is a functional module diagram of the host 2 shown in FIG. The host 2 includes a data reading module 210, an encoding format changing module 211, a dirty data modifying module 212, and a data standardizing module 213. The module referred to in the present invention is a computer program segment that performs a specific function. More suitable than the program to describe the execution process of the software in the computer Therefore, in the following description of the software in the present invention, the module description is used. The data reading module 210 is used to access the database 5, and obtain the attributes (database version, type, etc.) of the database 5, Reading a row of data from the database 5 (reading down from the first row), and obtaining attributes of the data of the row, wherein the data attributes include: data encoding format, character format, character length, etc. The encoding format change module 211 is configured to determine whether the encoding format of the data read by the data reading module 210 conforms to the standard. If the encoding format of the data does not conform to the standard, the encoding format of the poor material is changed to The standard encoding format. The standard of the data encoding format can be defined by the user. For example, the standard of the user-defined data encoding format is UTF-8. If the metric reading module 210 Ί buys the poor encoding format is not UTF-8. The code format change module 211 changes the code format of the data to 200844774. UTF-8. The dirty data modification module 212 is configured to determine according to the setting of the data conversion dictionary 21. Destroying whether the data read by the data reading module 210 contains dirty data. If the data contains dirty data, the data is decomposed according to the elemental separation and recognition algorithm, according to the modification of the dirty data. The rule is to modify the dirty data into the corresponding new data. The data standardization module 213 is configured to perform normalization processing on the data read by the data reading module 210, and import the newly processed data into the data. In the library 5. Specifically, if the data reading module 210 reads the data as a character, the data normalization module 213 determines whether the format of the character conforms to the standard. If the format of the character does not match Standard, the character format is changed to the standard character format. The standard of the character format can be defined by the user. For example, the standard of the user-defined character format is 8 bytes long. If the data character format read by the data reading module 210 is not 8 bytes long, then The data standardization module 213 changes the character format of the data to 8 bytes. If the data reading module 210 reads the data as an integer, the data normalization module 213 determines whether the range of the integer conforms to the standard. If the range of the integer does not conform to the standard, the integer is changed to an integer that conforms to the standard range. The standard of the integer range may be defined by the user. For example, the standard of the user-defined integer range is 0-255. If the integer read by the data reading module 210 is -1, the data normalization module 213 will use the integer- 1 Change to +1 or other user-defined values. Then, the data standardization module 213 performs consistency detection and repair on the modified new data, and compresses the modified new data 9 200844774 • and imports the compressed new data into the database 5. The data standardization module 213 is further configured to determine whether the data in the database 5 has been read. If the reading is not completed, the data reading module 210 reads the data of the next row in the database 5. Referring to Figure 4, there is shown a flow chart of a preferred embodiment of the database cleaning method of the present invention. First, in step S401, the data reading module 210 accesses the database 5 to acquire the attributes (database version, type, etc.) of the database 5. In step S402, the data reading module 210 reads a line of data from the database 5 (reads down from the first line) and acquires attributes of the line data. Among them, the data attributes include the tribute coding format, the character format and the length of the characters. In step S403, the encoding format changing module 211 determines whether the encoding format of the data read by the data reading module 210 conforms to the standard. If the encoding format of the material conforms to the standard, the flow proceeds directly to step S405. If the encoding format of the data does not conform to the standard, in step S404, the encoding format changing module 211 changes the encoding format of the data to a standard encoding format. The standard of the data encoding format can be defined by the user. For example, the standard of the user-defined data encoding format is UTF-8. If the data encoding format read by the data reading module 210 is not UTF-8, the encoding format changing module 211 will change the encoding format of the material to UTF-8. In step S405, the dirty data modification module 212 determines whether the data read by the data reading module 210 contains dirty data according to the setting of the data conversion dictionary 21. If the data does not contain dirty data, the flow proceeds directly to 10 200844774 to step S407. The data conversion dictionary 21 is used to define dirty data and modification rules of the dirty data. The dirty data is incomplete and erroneous data stored in the database 5, and the type of the dirty data can be defined by the user himself. The modification rule may be a hash function through which the dirty data can be converted into correct new data. As shown in Fig. 3, the data in the first column of the data conversion dictionary 21 is dirty data, and the data in the second column is new data. Through the operation of the hash function, the dirty data in the database 5 can be converted into corresponding new data. Suppose the hash function is Η (Χ) = Υ, where Η represents the conversion rule (ie, a data conversion function), X represents the old data, and Υ represents the new data. If the user defines cde as dirty data in the data conversion dictionary 21, and the conversion rule for defining the dirty data is to modify the dirty data cde to dec, through the conversion of the hash function, all the data in the database 5 The dirty data cde will be modified to dec. If the data contains dirty data, in step S406, the dirty data modification module 212 decomposes the poor material according to the elemental separation and recognition method, and modifies the dirty data according to the modification rule of the dirty data. For the corresponding new number. In step S407, the data standardization module 213 normalizes the data read by the data reading module 210, and introduces the standardized new data into the database 5. Specifically, if the data reading module 210 reads the data as a character, the data normalization module 213 determines whether the format of the character conforms to the standard. If the format of the character does not conform to the standard, the character format is changed to the standard character format. The standard of the character format can be defined by the user 11 200844774. The standard of the user-defined character format is 8 bytes long. If the data character format read by the data reading module 210 is not 8 bits. The leader of the data standardization module 213 will change the character format of the data to 8 bytes. If the data reading module 210 reads the data as an integer, the data normalization module 213 determines whether the range of the integers conforms to the standard. If the range of integers does not meet the criteria, the integer is changed to an integer that conforms to the standard range. The standard of the integer range can be defined by the user. For example, the standard of the user-defined integer range is 0-255. If the integer read by the data reading module 210 is -1, the data normalization module 213 will use the integer- 1 Change to +1 or other user-defined values. Then, the data standardization module 213 performs consistency detection and repair on the modified new data, compresses the modified new data, and imports the compressed new data into the database 5. In step S408, the data standardization module 213 determines whether the data in the database 5 has been read. If the reading is not completed, the flow returns to step S402. If the data in the database 5 is read, the flow ends. The database cleaning system and method of the present invention are disclosed above in the preferred embodiments, and are not intended to limit the present invention. Any person skilled in the art will be able to make modifications and refinements without departing from the spirit and scope of the invention, and the scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a hardware architecture of a preferred embodiment of the database cleaning system of the present invention. FIG. 2 is a functional module diagram of the host shown in FIG. 1. FIG. 3 is a schematic diagram of the data conversion dictionary shown in FIG. 1. 4 is a flow chart of a preferred embodiment of the database cleaning method of the present invention. [Main component symbol description] Display 1 Host 2 Keyboard 3 Mouse 4 Database 5 Data conversion dictionary 21 Data reading module 210 Encoding format change module 211 Dirty data modification module 212 Data standardization module 213 13

Claims (1)

200844774 . 十、申請專利範圍: 1 · 一種貧料庫清洗系統,包括主機及資料庫, 所述資料庫中存儲有資料及該資料的屬 性’所述資料屬性包括資料的編碼格式、字 元格式及字凡長短,其中,所述主機設置有 資料轉換字典,用於定義髒資料及所述髒資 料的修改規則,該主機還包括: , 資料讀取模組,用於訪問所述資料庫,獲取 所述資料庫的屬性,從資料庫中讀取一行資 料,並獲取該行資料的屬性; 編碼格式變更模組,用於在所述資料讀取模 t讀取的資料的編碼格式不符合設定標準 時,將該貧料的編碼格式變更為標準的編碼 格式; ,貝料修改模組,用於在所述資料讀取模組 σ貝取的貝料含有髒資料時,根據所述資料轉 換子典的a又置’按照所述髒資料的修改規 則’將所述資料中的髒資料修改為對應的新 資料;及 f料標準化模組’用於對所述資料讀取模組 貝取的資料進行標準化處理,將經過標準化 處理後的新資料導入所述資料庫中。 t申明專利範圍第1項所述的資料庫清洗系 、统’其中’所述髒資料是所述資料庫中存儲 200844774 • 的不完整、錯誤的資料。 3. 如申請專利範圍第2項所述的資料庫清 統,其中,所述修改規則是-種將所述髓:! 料轉換成正確的新資料的哈希函數。 貝 4. 如申請專利範圍第i項所述的資料庫、、土 、统,其中,所述標準化處理是指將資::土 元格式及範圍變更為標準的丰 予 圍。 軌 5. —種資料庫清洗方法,該方法包括如下牛 驟: V 設置資料轉換字典,定義資料庫中的髒資料 及所述髒資料的修改規則; ' / 訪問所述資料庫’獲取所述資料庫的屬性, 從資料庫中讀取一行資料,並獲取該行資料 的屬性’所述資料屬性包括資料的編碼格 式、字元格式及字元長短; 若所述項取的資料的編碼格式不符合設定標 準,則將該資料的編碼格式變更為標準的編 碼格式; 若所述讀取的資料含有髒資料,則根據所述 資料轉換字典的設置’按照所述髒資料的修 改規則,將所述資料中的髒資料修改為對應 的新資料; 對所述讀取的資料進行標準化處理,將經過 15 料導入所述資 變更資料編碼 化的步驟,直 理完畢。 200844774 標準化處理後的新資 重複所述讀取資料、 改髒資料及資料標準 料庫中的所有資料處 6.如申請專利範圍帛5項所述的資料 法,其中,所述髒資料是所述資料 的不元整、錯誤的資料。 7·如申請專利範圍第6項所述的資料 法,其中,所述修改規則是一種將 料轉換成正確的新資料的哈希函數 8·如申請專利範圍第5項所述的資料 法’其中’所述步驟對所述讀取的 標準化處理是指將所述讀取的資 格式及範圍變更為標準的字元格式 料庫.中, 格式、修 到所述資 庫清洗方 庫中存儲 庫清洗方 所述髒資 〇 庫清洗方 資料進行 料的字元 和範圍。 16200844774 . X. Patent application scope: 1 · A poor repository cleaning system, including a host and a database, wherein the database stores data and attributes of the data. The data attribute includes a data encoding format and a character format. And the length of the word, wherein the host is provided with a data conversion dictionary for defining the dirty data and the modified rules of the dirty data, the host further includes: a data reading module for accessing the database, Obtaining the attribute of the database, reading a row of data from the database, and obtaining attributes of the data of the line; and an encoding format changing module, wherein the encoding format of the data read by the data reading module t does not match When the standard is set, the coding format of the poor material is changed to a standard coding format; and the bedding modification module is configured to convert the data according to the data conversion when the material of the data reading module σ is dirty. The code a further sets the 'dirty data in the data to the corresponding new data according to the modification rule of the dirty data; and the f standardization module' is used for Said information read data module shell taken for the normalization, the new data after the introduction of the normalized library. The database cleaning system described in item 1 of the patent scope is the incomplete and erroneous data stored in the database. 3. The database clearing as described in claim 2, wherein the modifying rule is a hash function that converts the marrow material into a correct new material. 4. For the database, soil, and system mentioned in item i of the scope of patent application, the standardization process refers to changing the format and scope of the capital: soil element to the standard. Track 5. A database cleaning method, the method includes the following steps: V setting a data conversion dictionary, defining dirty data in the database and modifying rules of the dirty data; ' / accessing the database' to obtain the The attribute of the database, reading a line of data from the database, and obtaining the attribute of the line data. The data attribute includes the encoding format of the data, the character format and the length of the character; if the encoding format of the data taken by the item If the setting standard is not met, the encoding format of the data is changed to a standard encoding format; if the read data contains dirty data, according to the setting of the data conversion dictionary, according to the modification rule of the dirty data, The dirty data in the data is modified into corresponding new data; the standardized processing is performed on the read data, and the step of encoding the capital change data after 15 materials is directly processed. 200844774 Standardized and processed new capital repeats all the data in the reading data, the dirty data and the data standard library. 6. For the data method described in the patent application scope, the dirty data is Unclear and erroneous information about the information. 7. The data method as claimed in claim 6, wherein the modification rule is a hash function for converting the material into a correct new material. 8. The data method as described in claim 5 of the patent scope. Wherein the step of standardizing the reading refers to changing the read qualification and scope into a standard character format library. The format is repaired in the repository cleaning library. The characters and ranges of the materials used by the library cleaning party to clean the data. 16
TW96116802A 2007-05-11 2007-05-11 System and method for updating a database TWI402697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW96116802A TWI402697B (en) 2007-05-11 2007-05-11 System and method for updating a database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW96116802A TWI402697B (en) 2007-05-11 2007-05-11 System and method for updating a database

Publications (2)

Publication Number Publication Date
TW200844774A true TW200844774A (en) 2008-11-16
TWI402697B TWI402697B (en) 2013-07-21

Family

ID=44822713

Family Applications (1)

Application Number Title Priority Date Filing Date
TW96116802A TWI402697B (en) 2007-05-11 2007-05-11 System and method for updating a database

Country Status (1)

Country Link
TW (1) TWI402697B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970758A (en) * 2013-01-29 2014-08-06 鸿富锦精密工业(深圳)有限公司 Database accessing system and method
TWI695274B (en) * 2019-06-17 2020-06-01 建國科技大學 Method for checking input data with expert system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6073141A (en) * 1997-11-25 2000-06-06 International Business Machine Corporation System and method for synchronizing local versions of database
US20050289132A1 (en) * 2004-06-24 2005-12-29 Brian Lee Method and system for converting encoding character set

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970758A (en) * 2013-01-29 2014-08-06 鸿富锦精密工业(深圳)有限公司 Database accessing system and method
TWI695274B (en) * 2019-06-17 2020-06-01 建國科技大學 Method for checking input data with expert system

Also Published As

Publication number Publication date
TWI402697B (en) 2013-07-21

Similar Documents

Publication Publication Date Title
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN110991170B (en) Chinese disease name intelligent standardization method and system based on electronic medical record information
WO2020253399A1 (en) Log classification rule generation method, device, apparatus, and readable storage medium
WO2021051867A1 (en) Asset information identification method and apparatus, computer device and storage medium
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
CN110147433B (en) Text template extraction method based on dictionary tree
US8316041B1 (en) Generation and processing of numerical identifiers
CN111159770B (en) Text data desensitization method, device, medium and electronic equipment
CN112541338A (en) Similar text matching method and device, electronic equipment and computer storage medium
CN108334609B (en) Method, device, equipment and storage medium for realizing JSON format data access in Oracle
CN112035355A (en) Data processing method, data processing device, computer equipment and storage medium
CN111370132A (en) Electronic file analysis method and device, computer equipment and storage medium
CN111949550B (en) Method, device, equipment and storage medium for automatically generating test data
TW202123026A (en) Data archiving method, device, computer device and storage medium
CN112328631A (en) Production fault analysis method and device, electronic equipment and storage medium
CN109545341A (en) Medical consumable management method and relevant device based on data analysis
CN115658837A (en) Address data processing method and device, electronic equipment and storage medium
CN113254457B (en) Account checking method, account checking system and computer readable storage medium
CN110647577A (en) Data cube partitioning method and device, computer equipment and storage medium
US20240028608A1 (en) Schema validation with support for ordering
TW200844774A (en) System and method for updating a database
CN108804561B (en) Data synchronization method and device
CN110704635A (en) Conversion method and device for ternary group data in knowledge graph
CN115017868A (en) Clinical diagnosis data standardization method and device
CN110781647B (en) Method for realizing data format verification based on Flink

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees