TWI755995B

TWI755995B - A method and a system for screening engineering data to obtain features, a method for screening engineering data repeatedly to obtain features, a method for generating predictive models, and a system for characterizing engineering data online

Info

Publication number: TWI755995B
Application number: TW109145986A
Authority: TW
Inventors: 顏均泰; 高志強; 蔡紹軍
Original assignee: 科智企業股份有限公司
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2022-02-21
Also published as: TW202226007A

Abstract

本發明提供一種對工程資料進行篩選以得到特徵的方法與系統、對工程資料進行多次篩選以得到特徵的方法、產生預測模型的方法以及將工程資料線上特徵化的系統。該對工程資料進行篩選以得到特徵的系統包括：判斷單元、統計單元與處理單元。判斷單元供判斷一工程資料為數值型資料或類別型資料。統計單元供若該判斷單元判斷該工程資料為數值型資料，則將該工程資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗，藉以確認該工程資料是否為常態分佈。該處理單元供分別檢查該等類別中之每一者是否達到該第一、第二、第三、第四門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為有效特徵。 The present invention provides a method and system for screening engineering data to obtain features, a method for screening engineering data multiple times to obtain features, a method for generating prediction models, and a system for characterizing engineering data online. The system for screening engineering data to obtain features includes: a judging unit, a statistical unit and a processing unit. The judging unit is used for judging whether a project data is numerical data or category data. If the judging unit judges that the engineering data is numerical data, the statistical unit performs Kolmogorov-Smirnov test for each category in the engineering data and its corresponding fields, so as to confirm the Whether the engineering data is normally distributed. The processing unit is used to separately check whether each of the categories reaches the threshold value corresponding to the first, second, third, and fourth threshold value groups, and defines a category that meets the threshold value as a valid feature .

Description

A method and system for screening engineering data to obtain features, a method for screening engineering data multiple times to obtain features, a method for generating predictive models, and a system for online characterization of engineering data

本發明關於一種對工程資料進行篩選以得到特徵的方法、對工程資料進行篩選以得到特徵的系統、產生預測模型的方法以及將工程資料線上特徵化的系統。尤指在各種工程領域中，對工程資料進行有效快速的篩選與處理的方法與系統。 The present invention relates to a method for screening engineering data to obtain features, a system for screening engineering data to obtain features, a method for generating prediction models, and a system for characterizing engineering data online. Especially in various engineering fields, methods and systems for efficient and rapid screening and processing of engineering data.

一般而言，對各種工程資料如醫學工程、電機工程、機械工程領域中於過程中所產生的各種數據，需要經過取得資料、處理資料、選擇模型、訓練、評估、超參數調整的步驟，方能進入預測模型的產生，習知的資料科學家需要花費極多的時間方能正確建構出預測模型。換言之，於習知技術中，清理與組織資料花費時間太長，使得機器學習的技術無法有效的被運用。另一方面，提取特徵的過程是一個冗長的過程，習知技術中通常依賴於人工於領域知識、經驗和繁複的資料操作，且最終得到的特徵將會受到人工的主觀限制。儘管機器學習具有很多已證明的好處，但是成功地利用機器學習需要人付出巨大的努力，因為沒有一種演算法或模型可以解決所有可能的情況。例如，儘管醫學工程的研究人員熟悉臨床數據，但他們仍然缺乏將這些連床數據應用於大數據源所需的機器學習專業知識。 Generally speaking, for various engineering data such as medical engineering, electrical engineering, and mechanical engineering, all kinds of data generated in the process need to go through the steps of obtaining data, processing data, selecting model, training, evaluating, and adjusting hyperparameters. Having access to the generation of predictive models, conventional data scientists need to spend a considerable amount of time to correctly construct predictive models. In other words, in the conventional technology, it takes too long to clean and organize the data, so that the technology of machine learning cannot be used effectively. On the other hand, the process of feature extraction is a tedious process, and the conventional technology usually relies on manual domain knowledge, experience and complicated data manipulation, and the final features will be subject to the subjective constraints of manual workers. Although machine learning has many proven benefits, Successfully harnessing machine learning requires enormous human effort, as no single algorithm or model can address every possible situation. For example, while researchers in medical engineering are familiar with clinical data, they still lack the machine learning expertise needed to apply these concatenated data to big data sources.

當遇到有監督的機器學習問題時，數據科學家通常會負責創建解釋變量(也稱為特徵)，這些變量可以預測感興趣的結果。理想的特徵工程需要建構特徵，這些特徵不僅可以提供對數據本身的有用見解，還需考慮所使用的學習算法的任何限制。這不是一項瑣碎的過程，因為給定的機器學習算法的性能在很大程度上取決於輸入資料的品質。意即，從原始資料的建構特徵通常需要廣泛的領域知識，因此通常是由人類專家以反覆試驗的方式手動執行的。這使得特徵工程成為機器學習流程中至關重要且耗時的步驟。特徵工程也被稱為特徵建構，是從現有資料中建構新的特徵藉以訓練機器學習模型的過程。特徵建構比實際上使用的模型更重要，因為一個機器學習演算法只能從給定的資料中學習，所以如何建構一個和所需目標相關的特徵是至關重要的。 When confronted with supervised machine learning problems, data scientists are often tasked with creating explanatory variables (also known as features) that predict outcomes of interest. Ideal feature engineering entails constructing features that not only provide useful insights into the data itself, but also takes into account any limitations of the learning algorithm used. This is not a trivial process because the performance of a given machine learning algorithm depends heavily on the quality of the input data. That is, constructing features from raw data typically requires extensive domain knowledge and is therefore usually performed manually by human experts in a trial-and-error manner. This makes feature engineering a crucial and time-consuming step in the machine learning pipeline. Feature engineering, also known as feature construction, is the process of constructing new features from existing data to train machine learning models. Feature construction is more important than the actual model used, because a machine learning algorithm can only learn from given data, so how to construct a feature relevant to the desired goal is critical.

再，目前在資料分析的領域中，研究員常利用特徵工具中的關連式資料庫，藉由資料表與資料表間的關連，自動找出潛在的特徵，達成自動化且接近資料科學家手動的結果。然而，藉由關連式資料庫中資料表與資料表間的關連的分析方法只能應用在關連式資料庫的有關數據，即，數值型數據上，在具有類別特徵的資料上無法使用。 Furthermore, in the field of data analysis, researchers often use the relational database in the feature tool to automatically find potential features through the relationship between data tables and achieve automatic results that are close to the manual results of data scientists. However, the analysis method based on the relationship between the data tables in the relational database can only be applied to the relevant data of the relational database, that is, the numerical data, and cannot be used on the data with categorical characteristics.

再，於習知技術中，針對數值型特徵，窮舉出各種計算，再利用模型驗證看是否有提升結果，如果有則納入下一代的起始特徵，直到結果不再提升。然而，此種方法只適用在數值型數據上，在類別型的資料上無法使用。 Furthermore, in the prior art, for numerical features, various calculations are exhaustively listed, and then the model is used to verify whether there is an improvement result, and if so, the initial features of the next generation are included until Results no longer improve. However, this method is only applicable to numerical data and cannot be used for categorical data.

因此，為了克服前述問題，遂有本發明產生。 Therefore, in order to overcome the aforementioned problems, the present invention has been developed.

為克服前述技術問題，本發明採用多個方向生成新特徵：時間欄位資料群、關聯欄位資料群、領域欄位資料群：與習知技術相比，針對不同技術領域提供特定的特徵生成處理；利用統計分布比對2個資料集中各特徵相似性，去除不相似的特徵，同時支援數值與類別特徵；藉以不使用習知技術中複雜的演算法、關聯層數或模型檢驗等方法來尋找、評估特徵、也不需要使用經訓練後的演算法將找出的特徵進行檢查，不僅大幅度且全面地提昇選擇有效特徵的效能，且能自動化處理，達成與資料科學家相比更佳的準確率。 In order to overcome the aforementioned technical problems, the present invention adopts multiple directions to generate new features: time field data group, related field data group, field field data group: compared with the conventional technology, it provides specific feature generation for different technical fields. Processing; using statistical distribution to compare the similarity of each feature in the two data sets, removing dissimilar features, and supporting numerical and categorical features at the same time; so as not to use complex algorithms, correlation layers, or model testing methods in conventional techniques. Find, evaluate features, and do not need to use trained algorithms to check the found features, which not only greatly and comprehensively improves the efficiency of selecting effective features, but also automates processing to achieve better results than data scientists. Accuracy.

為達前述目的，本發明提供一種對工程資料進行篩選以得到特徵的方法，其包括： In order to achieve the foregoing purpose, the present invention provides a method for screening engineering data to obtain features, comprising:

A：判斷工程資料為數值型資料或類別型資料，若該工程資料為數值型資料，則進行步驟A1，若判斷該工程資料為類別型資料，則進行步驟A2； A: It is judged that the engineering data is numerical data or categorical data. If the engineering data is numerical data, go to step A1, and if it is judged that the engineering data is categorical data, go to step A2;

其中該步驟A1、A2如下： Wherein the steps A1 and A2 are as follows:

A1：將該數值型資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗(Kolmogorov-Smirnov Test)，藉以確認該工程資料是否為常態分佈；若檢驗結果為常態分佈，則執行步驟B1，若檢驗結果為非常態分佈，則執行步驟B2； A1: Perform the Kolmogorov-Smirnov Test on each category of the numerical data and its corresponding fields to confirm whether the engineering data is normally distributed; if If the test result is a normal distribution, go to step B1; if the test result is a non-normal distribution, go to step B2;

A2：將該類別型資料中之每一類別與其對應的多個欄位進行克雷莫V係數檢定而得到多個門檻值所組成的第一門檻值群組，然後進行步驟C1； A2: Calculate the Kremo V coefficient for each category in the category data and its corresponding fields Check to obtain a first threshold value group composed of multiple threshold values, and then proceed to step C1;

其中步驟B1、B2如下： The steps B1 and B2 are as follows:

B1：將該數值型資料中之每一類別與其對應的多個欄位進行T-test檢定而得到多個門檻值所組成的第二門檻值群組；然後進行步驟C2； B1: T-test is performed on each category of the numerical data and its corresponding multiple fields to obtain a second threshold value group consisting of multiple threshold values; then step C2 is performed;

B2：分別檢定該數值型資料中之每一類別所對應的多個欄位之離散度；若判斷該每一類別所對應的多個欄位之離散度超過預設值，則進行K-L散度(Kullback-Leibler divergence)檢定而得到多個門檻值所組成的第三門檻值群組，然後進行步驟C3；若判斷該數值型資料中之每一類別所對應的多個欄位之離散度未超過預設值，則進行Mann-Whitney U test而得到多個門檻值所組成的第四門檻值群組，然後進行步驟C4； B2: Check the dispersion of multiple fields corresponding to each category in the numerical data respectively; if it is judged that the dispersion of multiple fields corresponding to each category exceeds the preset value, perform KL divergence (Kullback-Leibler divergence) test to obtain a third threshold group consisting of multiple thresholds, and then go to step C3; if it is determined that the dispersion of multiple fields corresponding to each category in the numerical data is not If the preset value is exceeded, then the Mann-Whitney U test is performed to obtain a fourth threshold value group composed of multiple threshold values, and then step C4 is performed;

其中該步驟C1、C2、C3、C4如下： The steps C1, C2, C3, and C4 are as follows:

C1：分別檢查該等類別中之每一者是否達到該第一門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為特徵； C1: Check whether each of the categories reaches the corresponding threshold value in the first threshold value group, and define the category that has reached the threshold value as a feature;

C2：分別檢查該等類別中之每一者是否達到該第二門檻值群組中對應的門檻值，將有達到門檻值的類別定義為特徵； C2: Check whether each of the categories reaches the corresponding threshold value in the second threshold value group, and define the category that has reached the threshold value as a feature;

C3：分別檢查該等類別中之每一者是否達到該第三門檻值群組中對應的門檻值，將有達到門檻值的類別定義為特徵； C3: Checking whether each of the categories has reached the corresponding threshold value in the third threshold value group, and defining the category that has reached the threshold value as a feature;

C4：分別檢查該等類別中之每一者是否達到該第四門檻值群組中對應的門檻值，將有達到門檻值的類別定義為特徵。 C4: Check whether each of the categories reaches the corresponding threshold value in the fourth threshold value group, and define the category that meets the threshold value as a feature.

實施時，於該步驟A更包括：判斷工程資料為數值型資料或類別型資料，若該工程資料為數值型資料，將該數值型資料進行分為兩個群組並將該兩個群組進行相似性檢定，留下該數值型資料中具有顯著相同的類別與其所對應的欄位後進行該步驟A1；若判斷該工程資料為類別型資料，將該類別型資料進行分為兩個群組並將該兩個群組進行相似性檢定，留下該類別型資料中具有顯著性的類別與其所對應的欄位後進行該步驟A2。 During implementation, the step A further includes: judging whether the engineering data is numerical data or type data, and if the engineering data is numerical data, dividing the numerical data into two groups and then dividing the two groups into two groups. Carry out a similarity test, leaving the numerical data with significant similarities Step A1 is performed after the category and its corresponding field; if it is determined that the engineering data is category-type data, the category-type data is divided into two groups, and the two groups are checked for similarity, leaving Step A2 is performed after the significant categories in the category data and their corresponding fields.

本發明另提供一種對工程資料進行多次篩選以得到特徵的方法，其包括 The present invention also provides a method for performing multiple screening on engineering data to obtain features, which comprises the following steps:

a.對工程資料進行資料清理，其包含： a. Data cleaning for engineering data, including:

b.補償該工程資料中的欄位缺失值；從該工程資料產生多個欄位資料群，其中該等欄位資料群包括：時間欄位資料群、領域欄位資料群或關聯欄位資料群；其中該產生領域欄位資料群的步驟包含：對數值欄位資料群與類別欄位資料群進行拆解與組合而產生該領域欄位資料群；其中該生成關聯欄位資料群的步驟包含：以關聯性統計檢定從該數值欄位資料群與類別欄位資料群中具有顯著正或負相關者；；以及將該具有顯著正或負相關者進行運算及組合而產生關聯欄位資料群； b. Compensate the missing value of the field in the engineering data; generate a plurality of field data groups from the engineering data, wherein the field data groups include: time field data group, field field data group or related field data group; wherein the step of generating the field field data group includes: disassembling and combining the value field data group and the category field data group to generate the field field data group; wherein the step of generating the associated field data group Including: from the numerical field data group and the category field data group with significant positive or negative correlation by statistical test of correlation; and calculating and combining the significant positive or negative correlation to generate correlation field data group;

c.將該等時間欄位資料群、領域欄位資料群或關聯欄位資料群以如前述方法進行篩選，而得到經篩選的特徵。 c. Screen these time field data groups, field field data groups or related field data groups by the method described above to obtain the screened features.

實施時，於該步驟c之後更包括： During implementation, after the step c, it further includes:

d1：將該等經篩選的特徵以至少一機器學習演算法進行對抗式學習以驗證該等經篩選的特徵的相似性；以及 d1: perform adversarial learning on the screened features with at least one machine learning algorithm to verify the similarity of the screened features; and

d2：該等經篩選的特徵相似性不顯著者以網格搜索(Grid Search)調整該驗證相似性的門檻值，並重新進行步驟A1；保留該等經篩選的特徵相似性顯著者。 d2: Adjust the threshold value of the similarity verification by grid search (Grid Search) for those screened features with insignificant similarity, and perform step A1 again; keep the screened feature similarity Remarkable similarities.

實施時，於該步驟a中，該補償該工程資料中的欄位缺失值的步驟包含：利用中位數補值、利用極小值補值、利用極大值補值、利用眾數補值、利用4分位數補值或利用其它相似列補值；其中於該步驟a中更包括去除該等類別中所對應不同欄位但卻無變化者。 When implementing, in this step a, the step of compensating for the missing value of the column in the engineering data comprises: using the median value, using the minimum value, using the maximum value, using the mode value, using Compensation of 4 quantiles or use of other similar columns; in the step a, it further includes removing the different fields corresponding to these categories but there is no change.

實施時，於該步驟b中更包括產生類別欄位資料群的步驟，其包含：合併該類別欄位資料群中的第一筆、合併該類別欄位資料群中的最後一筆、合併該類別欄位資料群中的中間筆、合併該類別欄位資料群中出現最多者或合併該類別欄位資料群中之有變化者。 During implementation, the step b further includes a step of generating a category field data group, which includes: merging the first stroke in the category field data group, merging the last stroke in the category field data group, and merging the category The middle pen in the field data group, the one with the most occurrences in the combined field data group of this category, or the one with changes in the combined field data group of this category.

實施時，於該步驟b中更包括產生該時間欄位資料群的步驟包括：取該等欄位所對應的時間包含：年、月、日、星期、時、分、秒、或每15分。 During implementation, the step of generating the time field data group in step b includes: taking the time corresponding to the fields including: year, month, day, week, hour, minute, second, or every 15 minutes .

實施時，該生成產生領域欄位資料群的步驟更包含：對該數值欄位資料群拆解與組合而產生該領域欄位資料群的步驟包含：藉由該數值欄位資料群產生2進位的單位數領域欄位資料群或產生10進位的單位數領域欄位資料；其中該針對該類別欄位資料群進行拆解與組合的步驟包含：將該類別欄位資料群中的字串進行分割而產生多個經分割的字串：統計該等經分割的字串中之每一者出現的次數，留下次數門檻值以上的經分割的字串；以及將該等經留下的分割的字串中之每一者進行編碼，該等編碼中之每一者是為彼此相異。 When implemented, the step of generating and generating the field field data group further includes: the step of disassembling and combining the value field data group to generate the field field data group includes: generating binary digits from the value field data group The single-digit field data group or generating the single-digit field field data in decimal; wherein the step of dismantling and combining the type field data group includes: processing the character string in the type field data group Splitting to generate a plurality of split strings: counting the number of occurrences of each of the split strings, leaving split strings above a threshold of times; and the remaining splits encoding each of the strings of , each of the encodings is distinct from each other.

實施時，該關聯欄位資料群產生的步驟更包含： During implementation, the steps for generating the associated field data group further include:

對該數值欄位資料群進行相關性檢定，將具有顯著正或負相關的數值欄位資料，進行下列運算而產生關聯欄位資料群：： A correlation test is performed on this numerical column data group, and numerical columns with significant positive or negative correlation will be found Bit data, perform the following operations to generate the associated field data group:

加、減、乘、除、取LOG值、取三角函數的角度；或 Add, subtract, multiply, divide, take the LOG value, take the angle of a trigonometric function; or

對該類別型欄位資料群進行相關性檢定，將具有顯著正或負相關的類別型欄位資料群進行下列運算而產生關聯欄位資料群：將字串重新排列組合；或將該經重新排列組合之字串中之每一者進行編碼而產生關聯欄位資料群，其中該等編碼中之每一者為彼此相異。 The correlation test is performed on the categorical column data group, and the categorical column data group with significant positive or negative correlation is subjected to the following operations to generate the associated column data group: rearrange the strings; Each of the permuted strings is encoded to generate a group of associated field data, wherein each of the encodings are distinct from each other.

本發明另提供一種產生預測模型的方法，包括： The present invention also provides a method for generating a prediction model, comprising:

X：將以前述之方法所產生的經篩選的特徵標記為訓練群或測試群； X: mark the screened features generated by the aforementioned method as a training group or a test group;

Y：將該等經篩選的特徵混合； Y: mix the filtered features;

Z：透過至少一機器學習演算法區分該訓練群或該測試群，藉以建立預測模型。 Z: Distinguish the training group or the test group through at least one machine learning algorithm, so as to establish a prediction model.

本發明另提供一種對工程資料進行篩選以得到特徵的系統，其包括：處理器，該處理器包含判斷單元、統計單元以及處理單元。判斷單元供判斷工程資料為數值型資料或類別型資料；統計單元供若該判斷單元判斷該工程資料為數值型資料，則將該工程資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗(Kolmogorov-Smirnov Test)，藉以確認該工程資料是否為常態分佈。若該判斷單元判斷該工程資料非為數值型資料，則該統計單元將該工程資料中之每一類別與其對應的多個欄位進行克雷莫V係數檢定而得到多個門檻值所組成的第一門檻值群組；其中若該統計單元12之檢驗結果為常態分佈，則該統計單元將該工程資料中之每一類別與其對應的多個欄位進行T-test檢定而得到多個門檻值所組成的第二門檻值群組。若該統計單元的檢驗結果為非常態分佈，則該統計單元分別檢定每一類別所對應的多個欄位之離散度；若統計單元判斷該每一類別所對應的多個欄位之離散度超過預設值，則進行K-L散度(Kullback-Leibler divergence)檢定而得到多個門檻值所組成的第三門檻值群組。若該統計單元判斷每一類別所對應的多個欄位之離散度未超過預設值，則進行Mann-Whitney U test而得到多個門檻值所組成的第四門檻值群組。該處理單元供分別檢查該等類別中之每一者是否達到該第一、第二、第三、第四門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為特徵。 The present invention further provides a system for screening engineering data to obtain features, which includes: a processor, the processor includes a judging unit, a statistical unit and a processing unit. The judging unit is used for judging whether the engineering data is numerical data or type data; the statistical unit is used for judging each type of the engineering data and its corresponding multiple fields if the judging unit judges that the engineering data is numerical data. Kolmogorov-Smirnov Test is used to confirm whether the engineering data is normally distributed. If the judging unit judges that the engineering data is not numerical data, then the statistical unit performs the Cremo V coefficient test on each category of the engineering data and its corresponding fields to obtain a plurality of threshold values. The first threshold value group; wherein if the test result of the statistical unit 12 is a normal distribution, the statistical unit performs a T-test test on each category in the engineering data and its corresponding fields to obtain a plurality of thresholds A second threshold group of values. if the The test result of the statistical unit is a non-normal distribution, then the statistical unit tests the dispersion degree of the multiple fields corresponding to each category respectively; if the statistical unit determines that the dispersion degree of the multiple fields corresponding to each category exceeds the predetermined If the value is set, the Kullback-Leibler divergence test is performed to obtain a third threshold group consisting of multiple thresholds. If the statistic unit determines that the dispersion of the plurality of fields corresponding to each category does not exceed the preset value, the Mann-Whitney U test is performed to obtain a fourth threshold value group consisting of a plurality of threshold values. The processing unit is used to separately check whether each of the categories reaches the threshold value corresponding to the first, second, third, and fourth threshold value groups, and defines a category that meets the threshold value as a feature.

本發明另提供一種將工程資料線上特徵化的系統，其包括：伺服器，該伺服器包括處理器，所述處理器包括儲存單元以及處理單元。該儲存單元供接收來自客戶端所輸入之原始工程資料並將該原始工程資料儲存；其中該處理單元供讀取來自該儲存單元的該原始工程資料並將該原始工程資料以前述的方法進行處理而得到多個經篩選的特徵。 The present invention further provides a system for characterizing engineering data online, which includes: a server, the server includes a processor, and the processor includes a storage unit and a processing unit. The storage unit is used to receive the original engineering data input from the client and store the original engineering data; wherein the processing unit is used to read the original engineering data from the storage unit and process the original engineering data with the aforementioned method A number of filtered features are obtained.

為進一步瞭解本發明，以下舉較佳之實施例，配合圖式、圖號，將本發明之具體構成內容及其所達成的功效詳細說明如下。 In order to further understand the present invention, the following preferred embodiments are given, and the specific components of the present invention and the achieved effects are described in detail as follows in conjunction with the drawings and drawing numbers.

A、A1、A2、B1、B2、C1、C2、C3、C4、a、b、c、d、d1、d2:步驟 A, A1, A2, B1, B2, C1, C2, C3, C4, a, b, c, d, d1, d2: Steps

1:處理器 1: Processor

11:判斷單元 11: Judgment unit

12:統計單元 12: Statistics Unit

13:處理單元 13: Processing unit

2:將工程資料線上特徵化的系統 2: A system to characterize engineering data online

第1圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的流程圖。 FIG. 1 is a flow chart of an embodiment of a method for screening engineering data to obtain features of the present invention.

第2圖為本發明之對工程資料進行多次篩選以得到特徵的方法之實施例的流程圖。 FIG. 2 is a flow chart of an embodiment of a method for screening engineering data for multiple times to obtain features of the present invention.

第3圖為本發明之對工程資料中之特徵進行篩選的系統之實施例之示意圖。 FIG. 3 is a schematic diagram of an embodiment of a system for screening features in engineering data according to the present invention.

第4圖為本發明之將工程資料線上特徵化的系統之實施例之示意圖。 FIG. 4 is a schematic diagram of an embodiment of the system for online characterization of engineering data of the present invention.

第5圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的原始工程資料表。 FIG. 5 is an original engineering data table of an embodiment of the method for screening engineering data for features of the present invention.

第6圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例經時序抽樣處理後工程資料表。 FIG. 6 is an engineering data table after time series sampling processing according to an embodiment of the method for screening engineering data to obtain features of the present invention.

第7圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的將工程資料經時間特徵生成。 FIG. 7 is the generation of engineering data through time features according to an embodiment of the method for screening engineering data to obtain features of the present invention.

第8圖與第9圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例 Figures 8 and 9 are embodiments of the method for screening engineering data to obtain features of the present invention

第10圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的將工程資料經特徵相關性統計。 FIG. 10 is a feature correlation statistics of engineering data according to an embodiment of the method for screening engineering data to obtain features of the present invention.

第11圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的將工程資料經關連特徵生成。 FIG. 11 is an embodiment of the method for screening engineering data to obtain features of the present invention, and generating engineering data through associated features.

第12-15圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的將工程資料經特徵過濾。 Figures 12-15 illustrate filtering engineering data by features according to an embodiment of the method for screening engineering data to obtain features of the present invention.

第16圖為本發明之用於對工程資料進行篩選以得到特徵的方法之實施例的將工程資料經篩選後保留的特徵。 FIG. 16 shows the features retained after screening the engineering data according to an embodiment of the method for screening engineering data to obtain features of the present invention.

請參考第1圖，本發明提供一種對工程資料進行篩選以得到特徵的方法，其包括： Please refer to FIG. 1, the present invention provides a method for screening engineering data to obtain features, which includes:

A：判斷工程資料為數值型資料或類別型資料，若該工程資料為數值型資料，則進行步驟A1，若判斷該工程資料為類別型資料，則進行步驟A2； A: Judging that the engineering data is numerical data or category data, if the engineering data is numerical data If it is judged that the engineering data is a category type data, then go to Step A2;

其中所述步驟A1、A2如下： Wherein the steps A1 and A2 are as follows:

A1：將該數值型資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗(Kolmogorov-Smirnov Test)，藉以確認該工程資料是否為常態分佈；若檢驗結果為常態分佈，則執行一步驟B1，若檢驗結果為非常態分佈，則執行一步驟B2； A1: Perform the Kolmogorov-Smirnov Test on each category of the numerical data and its corresponding fields to confirm whether the engineering data is normally distributed; if If the test result is a normal distribution, a step B1 is performed, and if the test result is a non-normal distribution, a step B2 is performed;

A2：將該類別型資料中之每一類別與其對應的多個欄位進行克雷莫V係數檢定而得到多個門檻值所組成的第一門檻值群組，然後進行一步驟C1； A2: Perform the Cremo V coefficient test on each category and its corresponding fields in the category data to obtain a first threshold value group composed of multiple threshold values, and then perform a step C1;

其中所述步驟B1、B2如下： The steps B1 and B2 are as follows:

B2：分別檢定該數值型資料中之每一類別所對應的多個欄位之離散度；若判斷該每一類別所對應的多個欄位之離散度超過一預設值，則進行K-L散度(Kullback-Leibler divergence)檢定而得到多個門檻值所組成的第三門檻值群組，然後進行步驟C3；若判斷該數值型資料中之每一類別所對應的多個欄位之離散度未超過一預設值，則進行Mann-Whitney U test而得到多個門檻值所組成的第四門檻值群組，然後進行步驟C4； B2: Check the dispersion degree of the multiple fields corresponding to each category in the numerical data respectively; if it is determined that the dispersion degree of the multiple fields corresponding to each category exceeds a preset value, perform KL dispersion The Kullback-Leibler divergence test is performed to obtain a third threshold group consisting of multiple thresholds, and then step C3 is performed; if the dispersion of multiple fields corresponding to each category in the numerical data is determined If it does not exceed a preset value, then perform the Mann-Whitney U test to obtain a fourth threshold value group composed of multiple threshold values, and then perform step C4;

其中所述步驟C1、C2、C3、C4如下： Wherein the steps C1, C2, C3, and C4 are as follows:

C1：分別檢查該等類別中之每一者是否達到該第一門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為一特徵； C1: Check whether each of the categories reaches the corresponding threshold value in the first threshold value group, and define a category that has reached the threshold value as a feature;

C2：分別檢查該等類別中之每一者是否達到該第二門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵； C2: Check whether each of the categories reaches the corresponding threshold value in the second threshold value group, and define a category that has reached the threshold value as a feature;

C3：分別檢查該等類別中之每一者是否達到該第三門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵； C3: Check whether each of the categories reaches the corresponding threshold value in the third threshold value group, and define a category that has reached the threshold value as a feature;

C4：分別檢查該等類別中之每一者是否達到該第四門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵。 C4: Check whether each of the categories reaches the corresponding threshold value in the fourth threshold value group, and define a category that meets the threshold value as a feature.

請參考第3圖，本發明另提供一種對工程資料中進行篩選以得到特徵的系統，其包括處理器1，該處理器1包括：判斷單元11、統計單元12與處理單元13。判斷單元11供判斷一工程資料為數值型資料或類別型資料；統計單元12供若該判斷單元11判斷該工程資料為數值型資料，則將該工程資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗(Kolmogorov-Smirnov Test)，藉以確認該工程資料是否為常態分佈。若該判斷單元11判斷該工程資料非為數值型資料，則該統計單元12將該工程資料中之每一類別與其對應的多個欄位進行克雷莫V係數檢定而得到多個門檻值所組成的第一門檻值群組；其中若該統計單元12之檢驗結果為常態分佈，則該統計單元12將該工程資料中之每一類別與其對應的多個欄位進行T-test檢定而得到多個門檻值所組成的第二門檻值群組。若該統計單元12的檢驗結果為非常態分佈，則該統計單元12分別檢定每一類別所對應的多個欄位之離散度；若統計單元12判斷該每一類別所對應的多個欄位之離散度超過一預設值，則進行K-L散度(Kullback-Leibler divergence)檢定而得到多個門檻值所組成的第三門檻值群組。若該統計單元12判斷每一類別所對應的多個欄位之離散度未超過一預設值，則進行Mann-Whitney U test而得到多個門檻值所組成的第四門檻值群組。該處理單元13供分別檢查該等類別中之每一者是否達到該第一、第二、第三、第四門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為一特徵。 Referring to FIG. 3 , the present invention further provides a system for screening engineering data to obtain features, which includes a processor 1 , and the processor 1 includes a judgment unit 11 , a statistics unit 12 and a processing unit 13 . The judging unit 11 is used for judging that a project data is numerical data or type data; the statistics unit 12 is used for determining each type of the project data and its corresponding multiple if the judging unit 11 judges that the project data is numerical data The Kolmogorov-Smirnov Test is performed on the field to confirm whether the engineering data is normally distributed. If the judging unit 11 judges that the engineering data is not numerical data, the statistics unit 12 performs the Kremo V coefficient test on each type of the engineering data and its corresponding fields to obtain a plurality of threshold values. A first threshold value group is formed; wherein if the test result of the statistical unit 12 is a normal distribution, the statistical unit 12 performs a T-test test on each category in the engineering data and its corresponding multiple fields to obtain A second threshold value group formed by a plurality of threshold values. If the test result of the statistical unit 12 is an abnormal distribution, the statistical unit 12 checks the dispersion of the plurality of fields corresponding to each category respectively; if the statistical unit 12 determines the plurality of fields corresponding to each category If the dispersion exceeds a predetermined value, a Kullback-Leibler divergence test is performed to obtain a third threshold group consisting of a plurality of thresholds. If the statistical unit 12 determines that the dispersion of the plurality of fields corresponding to each category does not exceed a predetermined value, the Mann-Whitney U test is performed to obtain a fourth threshold value group formed by a plurality of threshold values. The processing unit 13 is for checking, respectively, whether each of the categories reaches all of the first, second, third, fourth threshold groups For the corresponding threshold value, a category that reaches the threshold value is defined as a feature.

以下詳述本發明的方法與系統。首先，本發明對工程資料進行篩選以得到特徵的方法與系統乃是供處理各種工程資料的方法與系統，工程資料的類型包括但不限於；金融工程、化學工程、機械工程、生醫工程等各領域的工程資料。首先，於該步驟A中，該判斷單元11判斷工程資料為數值型資料或類別型資料。若該工程資料為數值型資料，則進行步驟A1，若該判斷單元11判斷該工程資料為類別型資料，則進行步驟A2。所述步驟A1、A2說明如下。於該步驟A1中，以該統計單元12將該數值型資料中之每一類別與其對應的多個欄位進行柯爾莫哥洛夫-斯米爾諾夫檢驗(Kolmogorov-Smirnov Test)，藉以確認該工程資料是否為常態分佈；若檢驗結果為常態分佈，則執行步驟B1，若檢驗結果為非常態分佈，則執行步驟B2。柯爾莫哥洛夫-斯米爾諾夫檢驗(以下簡稱KS-test)是統計學上的數據分析方法，其專門針對分佈型數值型資料(distributed data set)進行檢定，而非對離散數據。KS-test不需對數值型資料之分佈做任何假設且對數值型資料之CDF(累計分佈函數曲線)的形狀及位置敏銳度高，能準確評估數值型資料間的相對分佈。若檢驗結果為常態分佈(p值>0.05)，則執行該步驟B1，若檢驗結果為非常態分佈(p值<0.05)，則執行該步驟B2。 The method and system of the present invention are described in detail below. First of all, the method and system for screening engineering data to obtain features of the present invention are methods and systems for processing various engineering data. The types of engineering data include but are not limited to: financial engineering, chemical engineering, mechanical engineering, biomedical engineering, etc. Engineering data in various fields. First, in step A, the judging unit 11 judges the engineering data as numerical data or category data. If the engineering data is numerical data, proceed to step A1, and if the determining unit 11 determines that the engineering data is categorical data, proceed to step A2. The steps A1 and A2 are described as follows. In the step A1, the statistical unit 12 is used to perform the Kolmogorov-Smirnov Test for each category in the numerical data and its corresponding fields to confirm Whether the engineering data is normally distributed; if the test result is a normal distribution, go to step B1; if the test result is a non-normal distribution, go to step B2. The Kolmogorov-Smirnov test (hereinafter referred to as KS-test) is a statistical data analysis method, which is specially designed for testing distributed numerical data (distributed data set) rather than discrete data. KS-test does not need to make any assumptions about the distribution of numerical data, and has high sensitivity to the shape and position of the CDF (cumulative distribution function curve) of numerical data, and can accurately evaluate the relative distribution of numerical data. If the test result is a normal distribution (p value>0.05), then execute step B1; if the test result is a non-normal distribution (p value<0.05), execute step B2.

再，於該步驟A2中，該統計單元12將該類別型資料中之每一類別與其對應的多個欄位進行克雷莫V係數檢定而得到多個門檻值所組成的第一門檻值群組，然後進行步驟C1。對於與卡方檢定(Chi-Square test)相關的統計檢定而言，此類統計檢定所使用的強度檢定就為克雷莫V係數檢定，克雷莫V係數檢定會對每個該類別型資料中之每一類別分別產生第一門檻值，該等第一門檻值用於衡量至少兩類別間中的多個欄位之間的相關程度。 Furthermore, in the step A2, the statistical unit 12 performs the Cremo V coefficient test on each category and its corresponding fields in the category data to obtain a first threshold value group composed of a plurality of threshold values group, and then proceed to step C1. For statistical tests related to the Chi-Square test, the strength test used in such statistical tests is the Cremo V coefficient test. Each category produces the first door Thresholds, the first thresholds are used to measure the degree of correlation between multiple fields in at least two categories.

該步驟B1、B2說明如下。於該B1步驟中，該統計單元12將該數值型資料中之每一類別與其對應的多個欄位進行T-test檢定而得到多個門檻值所組成的第二門檻值群組；然後進行步驟C2。T-test檢定的預設條件為類別所對應的欄位(依變數)為連續變數、類別所對應的欄位是從母群體中隨機抽樣而得；且母群體是為常態分佈。由於本發明與此步驟中之數值型資料已經經過KS-test，所以是符合T-test的預設條件。再，於步驟B2中，該統計單元12分別檢定該數值型資料中之每一類別所對應的多個欄位之離散度。若判斷該每一類別所對應的多個欄位之離散度超過一預設值，則該統計單元12進行K-L散度檢定而得到多個門檻值所組成的第三門檻值群組，然後進行步驟C3。K-L散度檢定乃是用於評估當使用一種假設分佈來近似另一種假設分佈時所損失的資訊量。若該統計單元12判斷該數值型資料中之每一類別所對應的多個欄位之離散度未超過一預設值，則進行Mann-Whitney U test而得到多個門檻值所組成的第四門檻值群組，然後進行步驟C4。該統計單元12進行Mann-Whitney U test檢定的目的是比較至少兩個隨機樣本之差異而推論到兩個母群體間的差異。做推論之依據是以工程資料所組成之抽樣分配為基礎，根據樣本中變項分數之等級，計算出檢定統計值U。 The steps B1 and B2 are described below. In the step B1, the statistical unit 12 performs T-test test on each type of the numerical data and its corresponding multiple fields to obtain a second threshold value group consisting of multiple threshold values; Step C2. The default conditions of the T-test test are that the field (dependent variable) corresponding to the category is a continuous variable, the field corresponding to the category is randomly sampled from the parent group, and the parent group is normally distributed. Since the numerical data in the present invention and this step have already passed the KS-test, they meet the preset conditions of the T-test. Furthermore, in step B2, the statistics unit 12 respectively checks the dispersion of the plurality of fields corresponding to each category in the numerical data. If it is determined that the dispersion of the plurality of fields corresponding to each category exceeds a predetermined value, the statistical unit 12 performs the KL divergence test to obtain a third threshold group consisting of a plurality of thresholds, and then performs Step C3. The K-L divergence test is used to assess the amount of information lost when one hypothetical distribution is used to approximate another. If the statistical unit 12 determines that the dispersion of the plurality of fields corresponding to each category in the numerical data does not exceed a predetermined value, then the Mann-Whitney U test is performed to obtain a fourth value consisting of a plurality of threshold values. Threshold value group, and then go to step C4. The purpose of the Mann-Whitney U test performed by the statistical unit 12 is to compare the differences of at least two random samples to infer the differences between the two parent groups. The basis for making inferences is based on the sampling distribution composed of engineering data, and the verification statistic value U is calculated according to the grade of the variable score in the sample.

所述步驟C1、C2、C3、C4分別如下。於步驟C1中，該處理單元13分別檢查該等類別中之每一者是否達到該第一門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為一特徵。於步驟C2中，該處理單元13分別檢查該等類別中之每一者是否達到該第二門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵。於步驟C3中，該處理單元13分別檢查該等類別中之每一者是否達到該第三門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵。於步驟C4中，該處理單元13分別檢查該等類別中之每一者是否達到該第四門檻值群組中對應的門檻值，將有達到門檻值的類別定義為一特徵。 The steps C1, C2, C3, and C4 are respectively as follows. In step C1, the processing unit 13 respectively checks whether each of the categories reaches the corresponding threshold value in the first threshold value group, and defines a category that has reached the threshold value as a feature. In step C2, the processing order Element 13 separately checks whether each of the categories reaches the corresponding threshold value in the second threshold value group, and defines a category that has reached the threshold value as a feature. In step C3, the processing unit 13 respectively checks whether each of the categories reaches the corresponding threshold value in the third threshold value group, and defines a category that has reached the threshold value as a feature. In step C4, the processing unit 13 respectively checks whether each of the categories reaches the corresponding threshold value in the fourth threshold value group, and defines a category that has reached the threshold value as a feature.

本發明前述技術方案乃是利用統計分析方法比對多個資料集中各特徵相似性，去除不相似的工程資料，且同時適用於工程資料中的數值型與類別型類別，藉以不使用習知技術中的各種演算法如：經大量資料訓練的演算法來尋找、評估工程資料中的特徵也能達到使用演算法尋找的高效之技術效果。於本發明的其他實施例中，機械製造領域的工程數據常用U-Test(在同台或同類型備數據值相近的條件下)，醫療領域的工程數據常用U-Test(類似儀器判斷值數值相近)，而金融領域的工程數據常用K-L test，因為金融工程的數值變化在不同日期、不同地區都有差異。 The aforementioned technical solution of the present invention uses statistical analysis methods to compare the similarity of features in multiple data sets, removes dissimilar engineering data, and is applicable to both numerical and categorical categories in engineering data, so that conventional techniques are not used. Various algorithms such as: algorithms trained with a large amount of data to find and evaluate features in engineering data can also achieve efficient technical results using algorithms to find. In other embodiments of the present invention, U-Test is commonly used for engineering data in the field of mechanical manufacturing (under the condition that the data values are similar to those of the same stage or of the same type), and U-Test is commonly used for engineering data in the medical field (similar to the value of instrument judgment values). The KL test is often used for engineering data in the financial field, because the numerical changes of financial engineering vary on different dates and in different regions.

在另一實施例中，於該步驟A中更包括：判斷工程資料為數值型資料或類別型資料，若該工程資料為數值型資料，將該數值型資料進行分為兩個群組並將該兩個群組進行相似性檢定，留下該數值型資料中具有顯著性的類別與其所對應的欄位後進行該步驟A1。本發明的相似性分析用於分類、聚類的資料處理流程。相似性需依據工程數據本身的屬性取值而加以分析，包括：屬性值處理、相似性度量標準等。若判斷該工程資料為類別型資料，將該類別型資料進行分為兩個群組並將該兩個群組進行相似性檢定，留下該類別型資料中具有顯著性的類別與其所對應的欄位後進行該步驟A2。本發明相似性度量標準包括：歐氏距離(Euclidean Distance)、曼哈頓距離(Manhattan Distance)、切比雪夫距離(Chebyshev Distance)、閔可夫斯基距離(Minkowski Distance)、標準化歐氏距離(Standardized Euclidean distance)、馬氏距離(Mahalanobis Distance)、夾角餘弦(Cosine)、漢明距離(Hamming distance)、傑卡德距離(Jaccard distance)與傑卡德相似係數(Jaccard similarity coefficient)、相關係數(Correlation coefficient)與相關距離(Correlation distance)、信息熵(Information Entropy)等，在此不贅述。 In another embodiment, the step A further includes: judging whether the engineering data is numerical data or category data, and if the engineering data is numerical data, dividing the numerical data into two groups and dividing the numerical data into two groups. The two groups are tested for similarity, and the step A1 is performed after the significant categories in the numerical data and their corresponding fields are left. The similarity analysis of the present invention is used for the data processing flow of classification and clustering. Similarity needs to be analyzed according to the attribute value of engineering data itself, including: attribute value processing, similarity measurement standard, etc. If it is judged that the engineering data is category-type data, the category-type data is divided into two groups and the similarity between the two groups is checked, and the significant category in the category-type data and its corresponding field backward Do this step A2. The similarity metrics of the present invention include: Euclidean Distance, Manhattan Distance, Chebyshev Distance, Minkowski Distance, Standardized Euclidean distance , Mahalanobis Distance, Cosine, Hamming distance, Jaccard distance, Jaccard similarity coefficient, Correlation coefficient and Correlation distance, information entropy, etc. will not be repeated here.

請參考第2圖，本發明另提供一種對工程資料進行多次篩選以得到特徵的方法，其包括： Please refer to FIG. 2, the present invention also provides a method for multiple screening of engineering data to obtain features, which includes:

a.對一工程資料進行資料清理，其包含： a. Data cleaning for a project data, including:

b.補償該工程資料中的欄位缺失值；從該工程資料產生多個欄位資料群，其中該等欄位資料群包括：一時間欄位資料群、一領域欄位資料群或一關聯欄位資料群；其中該產生領域欄位資料群的步驟包含：對一數值欄位資料群或一類別欄位資料群進行拆解與組合而產生該領域欄位資料群； b. Compensate for missing values of fields in the engineering data; generate a plurality of field data groups from the engineering data, wherein the field data groups include: a time field data group, a field field data group or an association A field data group; wherein the step of generating a field field data group includes: disassembling and combining a value field data group or a category field data group to generate the field field data group;

其中該產生關聯欄位資料群的步驟包含：以關聯性統計檢定從該數值欄位資料群與類別欄位資料群中具有顯著正或負相關者；以及將該具有顯著正或負相關者進行運算及組合而產生關聯欄位資料群； Wherein, the step of generating the related field data group includes: using the statistical test of correlation to have a significant positive or negative correlation from the numerical field data group and the category field data group; and performing the significant positive or negative correlation on the data group. Operations and combinations to generate associated field data groups;

c.將該等時間欄位資料群、領域欄位資料群或關聯欄位資料群以前述對工程資料進行篩選以得到特徵的方法之實施例進行篩選，而得到經多次篩選的特徵。 c. Screen the time field data group, the field field data group or the related field data group according to the above-mentioned embodiment of the method of screening engineering data to obtain features, so as to obtain features that have been screened multiple times.

以下詳述本發明對工程資料進行多次篩選以得到特徵的方法。首先，於該步驟a中，對工程資料，包括：金融工程、化學工程、機械工程、生醫工程等各領域的工程資料進行資料清理，資料清理乃是資料科學(DS)或機器學習(ML)的首要步驟，以便後續的步驟中能夠逐步找到真正關鍵的特徵。在另一實施例中，本發明進行資料清理的步驟包含：補償該工程資料中的欄位缺失值、去除該工程資料中欄位無變動的類別、檢查離群值資料群等。一般而言，各領域的工程資料很難是完整的，所以需要補償該工程資料中的欄位缺失值或去除該工程資料中欄位無變動的類別，以免影響後續訓練模型的建構以及找到沒有幫助的特徵而導致過度擬合。 The following describes in detail the method of the present invention to screen engineering data for many times to obtain features. Law. First, in step a, data cleaning is performed on engineering data, including engineering data in various fields such as financial engineering, chemical engineering, mechanical engineering, and biomedical engineering. Data cleaning is data science (DS) or machine learning (ML). ), so that the truly critical features can be found incrementally in subsequent steps. In another embodiment, the data cleaning step of the present invention includes: compensating for missing values of fields in the engineering data, removing categories of unchanged fields in the engineering data, checking outlier data groups, and the like. Generally speaking, it is difficult for engineering data in various fields to be complete, so it is necessary to compensate for the missing values of the fields in the engineering data or remove the categories with unchanged fields in the engineering data, so as not to affect the construction of the subsequent training model and find the features that help lead to overfitting.

再，在某些情況下，工程資料會隱藏著與主要資料非常不同的另幾種資料群，所以在其他實施例中，於該步驟a中，也需要檢查出離群值資料群單獨或分開處理。於該步驟a中，該補償該工程資料中的欄位缺失值的步驟包含：利用中位數補值、利用極小值補值、利用極大值補值、利用眾數補值、利用4分位數補值或利用其它相似列補值。在另一實施例中，於該步驟a中更包括去除該工程資料中欄位無變動的類別的步驟，例如，去除該等類別中所對應不同欄位但卻無變化者，藉以不讓此種類別不會影響到後續資料處理的分析與預測結果。 Furthermore, in some cases, the engineering data will hide other data groups that are very different from the main data, so in other embodiments, in the step a, it is also necessary to check out the outlier data groups individually or separately deal with. In the step a, the step of compensating for the missing value of the column in the engineering data comprises: using the median to supplement the value, using the minimum value to supplement the value, using the maximum value to supplement the value, using the mode to supplement the value, and using the quartile. Numerical complement or use other similar column complements. In another embodiment, the step a further includes a step of removing the categories whose fields in the engineering data are unchanged, for example, removing the fields corresponding to different fields in the categories but not changing, so as to prevent the These categories will not affect the analysis and prediction results of subsequent data processing.

再，於步驟b中，從該工程資料找出多個欄位資料群，其中該等欄位資料群包括：時間欄位資料群、領域欄位資料群或關聯欄位資料群。該產生領域欄位資料群的步驟包含：對一數值欄位資料群或一類別欄位資料群進行拆解與組合而產生該領域欄位資料群。在一實施例中，該生成關聯欄位資料群的步驟包含：以關聯性統計檢定從該數值欄位資料群與類別欄位資料群中具有顯著正或負相關者；以及將該具有顯著正或負相關者進行運算及組合而產生該關聯欄位資料群。再，於另一實施例中，產生前述數值欄位資料群的步驟包含；合併該等工程資料中之類別中之欄位所對應之欄位內的最大值、最小值、平均值、中位數或內眾數。於該步驟b中，更包括產生類別欄位資料群的步驟，其包含：合併該類別欄位資料群中的第一筆、合併該類別欄位資料群中的最後一筆、合併該類別欄位資料群中的中間筆、合併該類別欄位資料群中出現最多者或合併該類別欄位資料群中之有變化者。產生該時間欄位資料群的步驟包括：取該等欄位所對應的時間包含：年、月、日、星期、時、分、秒、或每15分。 Furthermore, in step b, a plurality of field data groups are found from the project data, wherein the field data groups include: time field data group, field field data group or related field data group. The step of generating the field field data group includes: disassembling and combining a value field data group or a category field data group to generate the field field data group. In one embodiment, the step of generating a related field data group includes: using a statistical test of correlation to determine which of the numerical field data group and the category field data group has a significant positive or negative correlation; and the significantly positive correlation or negative correlation Operators perform operations and combinations to generate the associated field data group. Furthermore, in another embodiment, the step of generating the aforementioned numerical field data group includes: merging the maximum value, minimum value, average value, median value in the fields corresponding to the fields in the categories in the engineering data number or inner mode. In the step b, it further includes a step of generating a category field data group, which includes: merging the first stroke in the category field data group, merging the last stroke in the category field data group, and merging the category field The middle pen in the data group, the one with the most occurrences in the data group of the combined category field, or the one with changes in the combined data group of the category field. The step of generating the time field data group includes: taking the time corresponding to the fields including: year, month, day, week, hour, minute, second, or every 15 minutes.

請參考第4圖，本發明另提供一種將工程資料自動特徵化的系統，該系統2包括：伺服器。該伺服器包括一處理器，該處理器包括一儲存單元以及一處理單元，其中該儲存單元供接收來自客戶端所輸入之一原始工程資料並將該原始工程資料儲存；其中該處理單元供讀取來自該儲存單元的該原始工程資料並將該原始工程資料以請前述實施例所述的方法進行處理而得到多個經篩選的特徵。在其他實施例中，本發明所請求保護的方法可以設置於一雲端系統中，讓客戶端以網際網路或區域網路的方式讓客戶從遠端輸入工程資料，客戶端所輸入的原始工程資料經由本發明前述實施例中所述的方法篩選出重要的特徵。 Please refer to FIG. 4 , the present invention further provides a system for automatically characterizing engineering data. The system 2 includes: a server. The server includes a processor, the processor includes a storage unit and a processing unit, wherein the storage unit is used for receiving an original project data input from the client and storing the original project data; wherein the processing unit is used for reading Taking the raw engineering data from the storage unit and processing the raw engineering data in the method described in the previous embodiment to obtain a plurality of screened features. In other embodiments, the method claimed in the present invention can be set in a cloud system, allowing the client to input engineering data from a remote end by means of the Internet or a local area network, and the original project input by the client The data were screened for important features by the methods described in the preceding examples of the present invention.

以下係以機械工程領域的工程資料來說明本發明的方法與系統。請參考第5圖，第5圖的表列出一種刀具加工的原始工程資料，包括時間戳記、命令轉速、實際轉速、主軸電流、刀具補償比例、閒置原因等。再，請參考第6圖，第6圖列出將前述刀具加工的原始工程資料根據數據的取樣率，自動推算數據適合的時間頻率，進行數據時間合併。例如：取樣率1秒的數據，自動合併成10秒1筆的數據等，因為根據採樣理論，取樣率必須比要觀察現象高至少10倍，才能觀察。 The method and system of the present invention are described below based on engineering data in the field of mechanical engineering. Please refer to Figure 5. The table in Figure 5 lists the original engineering data of a tool, including time stamp, command speed, actual speed, spindle current, tool compensation ratio, idle reason, etc. Again, please refer to Figure 6. Figure 6 lists the original engineering data processed by the aforementioned tool according to the sampling rate of the data, automatically calculates the appropriate time frequency of the data, and merges the data time. For example: sampling According to the sampling theory, the sampling rate must be at least 10 times higher than the phenomenon to be observed before it can be observed.

再，請參考第7圖，第7圖列出將前述刀具加工的原始工程資料中時間戳記進行產生該時間特徵的步驟包括：取該等欄位所對應的時間包含：年、月、日、星期、時、分、秒、或每15分。 Again, please refer to Fig. 7. Fig. 7 lists the steps of generating the time feature with the time stamp in the original engineering data processed by the aforementioned tool, including: taking the time corresponding to these fields including: year, month, day, Day, hour, minute, second, or every 15 minutes.

在另一實施例中，於該步驟c之後更包括：於步驟d1中，將該等經篩選的特徵以至少一機器學習演算法進行對抗式學習以驗證該等經篩選的特徵的相似性。於步驟d2中，該等經篩選的特徵相似性不顯著者以網格搜索(Grid Search)調整該驗證相似性的門檻值，並重新進行步驟A1；保留該等經篩選的特徵相似性顯著者。機器學習中分類模型有非常多種，例如LR、SVM或XGBoost；以及深度學習模型CNN、LSTM等，然而，不同的模型都具有不同的參數設定需自行調整與選擇。當工程資料的數據量太大，網格搜索很容易成為一種消耗系統資源，所以本發明在經過前述多個步驟之後才使用網格搜索，大幅提昇網格搜索的效能。 In another embodiment, after the step c, it further includes: in step d1, performing adversarial learning on the screened features with at least one machine learning algorithm to verify the similarity of the screened features. In step d2, the screened ones with insignificant feature similarity adjust the threshold for verifying similarity by grid search, and perform step A1 again; keep the screened ones with significant feature similarity . There are many classification models in machine learning, such as LR, SVM or XGBoost; and deep learning models CNN, LSTM, etc. However, different models have different parameter settings and need to be adjusted and selected by themselves. When the data volume of engineering data is too large, grid search can easily become a consumption of system resources, so the present invention uses grid search after the above-mentioned steps, which greatly improves the efficiency of grid search.

再，於其他實施例中，前述其中該領域欄位資料群產生的步驟更包含：對該數值欄位資料群拆解與組合而產生該領域欄位資料群的步驟包含藉由該數值欄位資料群產生2進位單位數領域欄位資料群或產生10進位單位數領域欄位資料。該對該類別欄位資料群進行拆解與組合的步驟包含：將該類別欄位資料群中的字串進行分割而產生多個經分割的字串、統計該等經分割的字串中之每一者出現的次數，留下一次數門檻值以上的經分割的字串；以及將該等留下的經分割的字串中之每一者進行編碼，該等編碼中之每一者為彼此相異，使每一個編碼都具有獨立性，本發明藉以從類別欄位資料群中產生新的、有意義的類別欄位資料。 Furthermore, in other embodiments, the aforementioned step of generating the field field data group further includes: the step of disassembling and combining the value field data group to generate the field field data group includes using the value field The data group generates binary single-digit field field data group or generates decimal single-digit field field data. The step of disassembling and combining the category field data group includes: dividing the character strings in the category field data group to generate a plurality of divided character strings, and counting the number of divided character strings among the divided character strings. the number of occurrences of each, leaving a segmented string above the threshold of one count; and encoding each of the remaining segmented strings, each of the encodings being are different from each other, so that each code is independent, and the present invention uses Generate new, meaningful class field data from the class field data group.

再，請參考第8圖，其列出前述該生成領域欄位資料群的步驟：藉由該數值欄位資料群產生2進位單位數數值欄位資料群。再，請參考第9圖，其列出對數值欄位資料群進行拆解與組合而產生該領域欄位資料群的實施例。請參考第10圖，第10圖示出本發明之關聯欄位資料群產生的步驟：針對該數值型資料進行相關性檢定後，將具有顯著相關的數值型資料(p值<0.05)，進行等加/減/乘/除、取LOG、取三角函數的角度等步驟產生第11圖之結果。例如，在第10圖中，將具有相關性的類別實際轉速最大乘以主軸電流最大值而產生領域欄位資料群。 Then, please refer to FIG. 8 , which lists the aforementioned steps of generating the field field data group: generating the binary-unit value field data group from the value field data group. Then, please refer to FIG. 9, which lists an embodiment of disassembling and combining the value field data group to generate the field field data group. Please refer to Fig. 10. Fig. 10 shows the steps of generating the related field data group of the present invention: after the correlation test is performed on the numerical data, the numerical data with significant correlation (p value < 0.05) will be tested for Steps such as adding/subtracting/multiplying/dividing, taking LOG, taking the angle of trigonometric functions, etc., produce the result shown in Figure 11. For example, in Fig. 10, the field column data group is generated by multiplying the maximum actual speed of the relevant category by the maximum value of the spindle current.

在另一實施例中，該關聯欄位資料群產生的步驟更包含：將該數值型資料進行相關性檢定，將具有顯著正或負相關的數值型資料，進行下列運算而產生數值型欄位資料群。前述運算包括：加/減/乘/除、取LOG值、取各種三角函數的各種角度；或將該類別型資料群進行相關性檢定，包括卡方檢定相關系列檢定等，將具有顯著正或負相關的類別型欄位資料群進行下列運算而產生關聯欄位資料群。前述運算幫包括：將字串重新排列組合，例如，將字母A、B、C重新牌溜組合為ABC、ACB、BCA等；或將該經重新排列組合之字串中之每一者進行編碼而產生關聯欄位資料群，其中該等編碼中之每一者為彼此相異。例如，將字串AB、AC、分別對應110、101、111。 In another embodiment, the step of generating the associated field data group further includes: performing a correlation test on the numerical data, and performing the following operations on the numerical data with significant positive or negative correlation to generate the numerical field data group. The aforementioned operations include: adding/subtracting/multiplying/dividing, taking LOG values, taking various angles of various trigonometric functions; or performing correlation tests on this type of data group, including chi-square test related series tests, etc., will have significant positive or negative results. The negatively correlated categorical field data group is subjected to the following operations to generate the correlated field data group. The aforementioned arithmetic help includes: rearranging the strings, for example, recombining the letters A, B, C into ABC, ACB, BCA, etc.; or encoding each of the rearranged strings A group of associated field data is generated wherein each of the codes are distinct from each other. For example, the character strings AB, AC, correspond to 110, 101, and 111, respectively.

本發明另提供一種產生預測模型的方法，包括以下步驟。於步驟X中，將以本發明前述實施例中之方法所產生的經篩選的特徵標記為訓練群或測試群，訓練群用為測試資料，測試群作為測試資料；藉以確定本發明的方法與系統所找出的特徵是否準確。再於步驟Y中將該等經篩選的特徵混合。再，於步驟Z中，透過至少一機器學習演算法區分該訓練群或該測試群，藉以由此建立一預測模型。 The present invention further provides a method for generating a prediction model, comprising the following steps. In step X, the screened features generated by the method in the foregoing embodiment of the present invention are marked as training group or test group, the training group is used as test data, and the test group is used as test data; Whether the features found by the invented method and system are accurate. The screened features are then combined in step Y. Furthermore, in step Z, at least one machine learning algorithm is used to distinguish the training group or the test group, thereby establishing a prediction model.

第12圖至第16圖示出本發明前述實施例中由步驟A開始至步驟C1、C2、C3與C4的過程。以處理單元13分別檢查該等類別中之每一者是否達到該第一、第二、第三、第四門檻值群組中所對應的門檻值，將有達到門檻值的類別定義為一特徵，沒有達到門檻值的類別則移除。例如，如第12圖所示，將超過所設定的相似性門檻值(0.5)經過KS-Test、U-Test的類別實際轉速最大值保留；移除未達到所設定的門檻值的實際轉速最小值。再，第12圖至第16圖中更列出了如本發明實施例中步驟d1、d2所述的將該等經篩選的特徵以至少一機器學習演算法進行對抗式學習以驗證該等經篩選的特徵的相似性的結果。例如，如第12圖所示，對抗式驗證的結果AUC的值為0.98，之後以網格搜索調整前述相似門檻值。請再參考第13圖，以調整後的相似性門檻值留下或移除類別，例如，類別中之命令轉速於此步驟中被移除。最後，請參考第16圖，直到AUC的值穩定後，得到本實施例最後篩選出的5個特徵。 FIGS. 12 to 16 illustrate the process from step A to steps C1 , C2 , C3 and C4 in the aforementioned embodiment of the present invention. The processing unit 13 respectively checks whether each of the categories has reached the threshold value corresponding to the first, second, third, and fourth threshold value groups, and defines a category that has reached the threshold value as a feature , categories that do not meet the threshold are removed. For example, as shown in Figure 12, the maximum actual speed of the category that exceeds the set similarity threshold (0.5) is retained through KS-Test and U-Test; the actual speed that does not reach the set threshold is removed. value. Furthermore, FIG. 12 to FIG. 16 further show that the screened features are subjected to adversarial learning with at least one machine learning algorithm as described in steps d1 and d2 in the embodiment of the present invention to verify the selected features. Results of the similarity of the filtered features. For example, as shown in Figure 12, the AUC value of the adversarial verification result is 0.98, and then the aforementioned similarity threshold is adjusted by grid search. Please refer to FIG. 13 again to leave or remove the class with the adjusted similarity threshold, eg, the command speed in the class is removed in this step. Finally, please refer to Figure 16, until the value of AUC is stable, the 5 features finally screened out in this example are obtained.

因此，本發明具有以下之優點： Therefore, the present invention has the following advantages:

1.本發明較現有特徵篩選方法而言，從原始工程資料中增加時間合併及領域特徵分解處理。 1. Compared with the existing feature screening method, the present invention adds time merging and domain feature decomposition processing from the original engineering data.

2.本發明利用多個統計檢定的組合快速篩選與評估特徵，不需要利用耗時的多種演算法以及習知的關聯層數或模型檢驗就可以快速篩選出有效的特徵。 2. The present invention uses a combination of multiple statistical tests to quickly screen and evaluate features, and can quickly screen out effective features without using time-consuming multiple algorithms and conventional correlation layers or model tests.

3.利用相似性來選擇特徵，可以有效去除高相關但低相似的特徵，減少後續模組過度擬合的現象，提升模型的準確率。 3. Using similarity to select features can effectively remove features with high correlation but low similarity, reduce the phenomenon of overfitting of subsequent modules, and improve the accuracy of the model.

4.本發明自動化處理、達成與資料科學家相近的準確率、支援數值型與類別型的特徵以及提高找出有效特徵的效率 4. The present invention automates processing, achieves an accuracy similar to that of a data scientist, supports numerical and categorical features, and improves the efficiency of finding valid features

以上所述乃是本發明之具體實施例及所運用之技術手段，根據本文的揭露或教導可衍生推導出許多的變更與修正，仍可視為本發明之構想所作之等效改變，其所產生之作用仍未超出說明書及圖式所涵蓋之實質精神，均應視為在本發明之技術範疇之內，合先陳明。 The above are the specific embodiments of the present invention and the technical means used. According to the disclosure or teaching herein, many changes and modifications can be derived and deduced, which can still be regarded as equivalent changes made by the concept of the present invention. If the function does not exceed the substantial spirit covered by the description and drawings, it should be regarded as being within the technical scope of the present invention, and should be stated first.

綜上所述，依上文所揭示之內容，本發明確可達到發明之預期目的，提供一種對工程資料進行篩選以得到特徵的方法與系統、對工程資料進行多次篩選以得到特徵的方法、產生預測模型的方法以及將工程資料線上特徵化的系統，不需要利用複雜的演算法可以快速評估有效特徵，極具產業上利用之價植，爰依法提出發明專利申請。 To sum up, according to the content disclosed above, the present invention can clearly achieve the intended purpose of the invention, and provides a method and system for screening engineering data to obtain features, and a method for screening engineering data multiple times to obtain features , The method for generating prediction models and the system for characterizing engineering data online can quickly evaluate effective features without the use of complex algorithms, which is very valuable for industrial use, and can file invention patent applications in accordance with the law.

A、A1、A2、B1、B2、C1、C2、C3、C4:步驟 A, A1, A2, B1, B2, C1, C2, C3, C4: Steps

Claims

A method for screening engineering data to obtain features, comprising:

A: It is judged that a project data is numerical data or categorical data, if the project data is numerical data, then go to step A1, if it is judged that the project data is categorical data, go to step one A2;

Wherein the steps A1 and A2 are as follows:

A1: Perform the Kolmogorov-Smirnov Test on each category of the numerical data and its corresponding fields to confirm whether the engineering data is normally distributed; if If the test result is a normal distribution, a step B1 is performed, and if the test result is a non-normal distribution, a step B2 is performed;

A2: Perform the Cremo V coefficient test on each category and its corresponding fields in the category data to obtain a first threshold value group composed of multiple threshold values, and then perform a step C1;

The steps B1 and B2 are as follows:

B1: perform T-test test on each category in the numerical data and its corresponding fields to obtain a second threshold value group composed of a plurality of threshold values; then perform a step C2;

B2: Check the dispersion degree of the multiple fields corresponding to each category in the numerical data respectively; if it is determined that the dispersion degree of the multiple fields corresponding to each category exceeds a preset value, perform KL dispersion The Kullback-Leibler divergence test is performed to obtain a third threshold group consisting of multiple thresholds, and then step C3 is performed; if the dispersion of multiple fields corresponding to each category in the numerical data is judged If the degree does not exceed a preset value, then perform the Mann-Whitney U test to obtain a fourth threshold value group composed of multiple threshold values, and then perform a step C4;

The steps C1, C2, C3, and C4 are as follows:

C1: Check whether each of the categories reaches the corresponding threshold value in the first threshold value group, and define a category that has reached the threshold value as a feature;

C2: Check whether each of the categories reaches the corresponding threshold value in the second threshold value group, and define a category that has reached the threshold value as a feature;

C3: Check whether each of the categories reaches the corresponding threshold value in the third threshold value group, and define a category that has reached the threshold value as a feature;

C4: Check whether each of the categories reaches the corresponding threshold value in the fourth threshold value group, and define a category that meets the threshold value as a feature.

The method of claim 1, wherein the step A further comprises:

Judging that a project data is numerical data or category data, if the project data is numerical data, divide the numerical data into two groups, and perform similarity test on the two groups, and leave the After the significant categories in the numerical data and their corresponding fields, go to step A1;

If it is judged that the engineering data is category-type data, the category-type data is divided into two groups and the similarity between the two groups is checked, leaving the category-type data with significantly the same category and its corresponding After entering the field, proceed to step A2.

A method for performing multiple screening on engineering data to obtain features, comprising:

a. Data cleaning for a project data, including:

Compensate for missing values in the fields in the engineering data;

b. Generate a plurality of field data groups from the engineering data, wherein the field data groups include: a time field data group, a field field data group or a related field data group;

Wherein the step of generating the field field data group includes: generating a value field data group or a The category field data group is disassembled and combined to generate the field field data group;

The step of generating the associated field data group includes:

There is a significant positive or negative correlation between the numerical field data group and the categorical field data group by a statistical test of association; and

Operations and combinations are performed on those with significant positive or negative correlations to generate groups of associated field data;

c. Use such time field data groups, domain field data groups or associated field data groups as The method described in claim 1 performs screening to obtain features that have been screened multiple times.

The method according to claim 3, further comprising after step c:

d1: perform adversarial learning on the screened features with at least one machine learning algorithm to verify the similarity of the screened features; and

d2: Adjust the threshold value of the similarity verification by grid search for the screened ones with insignificant feature similarity, and perform step A1 again; keep the screened ones with significant feature similarity.

The method according to claim 3, wherein in the step a, the step of compensating for the missing value of the field in the engineering data comprises: using the median to make up the value, using the minimum value to make up the value, using the maximum value to make up the value, Use mode complement value, use quartile complement value or use other similar column complement value; wherein in the step a, it further includes removing the different fields corresponding to these categories but there is no change.

The method of claim 3, wherein the step b further comprises the step of generating a category field data group, which includes: merging the first stroke in the category field data group, merging the category field data group The last in the group, the middle in the group of merged fields of this category Pen, merge the most frequent occurrences in the data group of this type of field, or merge the change in the data group of this type of field.

The method of claim 3, wherein the step b further comprises the step of generating the time field data group comprising: taking the time corresponding to the fields including: year, month, day, week, hour , minutes, seconds or every 15 minutes.

The method of claim 3, wherein the step of generating and generating a field field data group further comprises:

The steps of disassembling and combining the numerical field data group to generate the field field data group include:

Generate a binary single-digit field data group or generate a decimal single-digit field field data from the numerical field data group;

The steps of dismantling and combining the category field data group include:

splitting the strings in the class field data group to generate a plurality of split strings;

counting the number of occurrences of each of the split strings, leaving split strings above the one count threshold; and

Each of the remaining split strings is encoded, each of the encodings being distinct from each other.

The method of claim 3, wherein the step of generating the associated field data group further comprises:

The correlation test is performed on the numerical field data group, and the numerical field data with significant positive or negative correlation is subjected to the following operations to generate a related field data group:

Add, subtract, multiply, divide, take the LOG value, take the angle of a trigonometric function; or

Perform a correlation test on this category-type field data group, and there will be categories with significant positive or negative correlations The type field data group performs the following operations to generate the associated field data group:

Rearranging the strings; or encoding each of the rearranged strings to generate a group of associated field data, wherein each of the encodings are distinct from each other.

A method of producing a predictive model comprising:

X: mark the feature generated by the method described in any one of request item 1 to request item 9 as a training group or a test group;

Y: mix the filtered features;

Z: Distinguish the training group or the test group by at least one machine learning algorithm, thereby establishing a prediction model.

A system for screening engineering data to obtain features, comprising:

a processor, the processor includes a judgment unit, a statistics unit and a processing unit;

The judging unit is used for judging whether a project data is numerical data or category data;

Wherein, the statistical unit is used to perform Kolmogorov-Smirnov test (Kolmogorov-Smirnov test) for each category of the engineering data and its corresponding fields if the judgment unit judges that the engineering data is numerical data. -Smirnov Test), to confirm whether the engineering data is normally distributed;

If the judging unit judges that the engineering data is not numerical data, the statistical unit will perform the Cremo V coefficient test on each category of the engineering data and its corresponding fields to obtain a plurality of threshold values. The first threshold group of ;

If the test result of the statistical unit is a normal distribution, the statistical unit performs T-test test on each category in the engineering data and its corresponding fields to obtain a second threshold value composed of a plurality of threshold values group;

Wherein, if the test result of the statistical unit is an abnormal distribution, the statistical unit tests the dispersion degree of the fields corresponding to each category respectively; if the statistical unit determines the dispersion of the fields corresponding to each category If the degree exceeds a preset value, the KL divergence (Kullback-Leibler divergence) test is performed to obtain a third threshold value group composed of multiple threshold values; if the statistical unit determines the multiple fields corresponding to each category The dispersion does not exceed a preset value, then perform the Mann-Whitney U test to obtain a fourth threshold group consisting of multiple thresholds;

Wherein the processing unit is used to separately check whether each of the categories has reached the threshold value corresponding to the first, second, third and fourth threshold value groups, and define a category that has reached the threshold value as a feature .

A system for characterizing engineering data online, comprising:

A server includes a processor, the processor includes a storage unit and a processing unit, wherein the storage unit is used to receive an original project data input from a client and store the original project data; wherein the processing unit Features are obtained by reading the original engineering data from the storage unit and processing the original engineering data with the method described in any one of claims 1 to 9.