WO2018161824A1 - 异常数据检测方法和装置 - Google Patents

异常数据检测方法和装置 Download PDF

Info

Publication number
WO2018161824A1
WO2018161824A1 PCT/CN2018/077507 CN2018077507W WO2018161824A1 WO 2018161824 A1 WO2018161824 A1 WO 2018161824A1 CN 2018077507 W CN2018077507 W CN 2018077507W WO 2018161824 A1 WO2018161824 A1 WO 2018161824A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
piece
attribute values
text
pieces
Prior art date
Application number
PCT/CN2018/077507
Other languages
English (en)
French (fr)
Inventor
李刚毅
赵小光
Original Assignee
博彦科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 博彦科技股份有限公司 filed Critical 博彦科技股份有限公司
Publication of WO2018161824A1 publication Critical patent/WO2018161824A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Definitions

  • the present invention relates to the field of data detection technology, and in particular to an abnormal data detection method and apparatus.
  • the training results are easy to over-fitting, that is, the training results are closer to the characteristics of the original training data set, rather than the characteristics of the target data set.
  • the present invention provides an abnormal data detecting method and apparatus to solve the problems caused by the large data dimension for training existing in the prior art.
  • an abnormal data detecting method includes: acquiring text; extracting a plurality of pieces of data from the text, wherein each of the plurality of pieces of data is composed of a plurality of attributes a value composition, each attribute value corresponds to an attribute of the data, the attributes of each piece of data are the same; and the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data, wherein The number of new attribute values of each piece of data is less than the number of attribute values of each piece of data before converging; using the new attribute value of each piece of data for machine learning to obtain a data model, wherein the data model is used To distinguish whether the data extracted from the text is abnormal data.
  • obtaining the text includes: obtaining data expressed in a natural language in the text.
  • extracting the plurality of pieces of data from the text comprises: converting the text data into a plurality of pieces of data for machine learning.
  • converting the text data into a plurality of pieces of data for machine learning comprises: normalizing the text data, wherein the normalizing process is to remove special characters and/or in the text data. Or converting uppercase letters in the text data to lowercase letters and/or extracting the plurality of attribute values in the text data.
  • extracting the plurality of attribute values in the text data comprises: extracting a plurality of attribute values from the plurality of pieces of data for machine learning by word segmentation analysis or from the machine learning by word frequency analysis Extract multiple attribute values from multiple pieces of data.
  • the merging the attribute values of each piece of data to obtain the new attribute value of each piece of data includes: merging the attribute values of each piece of data by principal component analysis to obtain a new one of each piece of data Property value.
  • the merging the attribute values of each piece of data to obtain the new attribute value of each piece of data includes directly combining the attribute values of each piece of data to obtain a new attribute value of each piece of data.
  • the method includes: obtaining a priority of the new attribute value of each piece of data; and prioritizing each new attribute value The level selects the one or more new attribute values from the all new attribute values; and performs machine learning to obtain a data model according to the filtered one or more new attribute values.
  • performing machine learning using the new attribute value of each piece of data to obtain a data model comprises: classifying each piece of data according to a new attribute value of each piece of data; and data of the same classification according to an occurrence time and The frequency is learned to obtain the data model, wherein the occurrence time and the frequency are one of the basis for distinguishing the abnormal data.
  • An abnormal data detecting apparatus includes: an obtaining unit configured to acquire text; an extracting unit configured to extract a plurality of pieces of data from the text, wherein each of the plurality of pieces of data is multi-data Attribute values are formed, each attribute value corresponds to an attribute of the data, and the attributes of each piece of data are the same; the merging unit is configured to merge the attribute values of each piece of data to obtain each piece of data.
  • the learning unit is set to use the new attribute value of each piece of data
  • Machine learning is performed to obtain a data model, wherein the data model is used to distinguish whether the data extracted from the text is abnormal data.
  • the obtaining unit includes: a first acquiring module, configured to acquire data expressed in a natural language in the text.
  • the extracting unit comprises: a conversion module configured to convert the text data into a plurality of pieces of data for machine learning.
  • the extracting unit includes: an extracting module configured to convert the text data into a plurality of pieces of data for machine learning, and then extract a plurality of pieces of data for the machine learning by word segmentation analysis The attribute value or a plurality of attribute values are extracted from the plurality of pieces of data for machine learning by word frequency analysis.
  • the merging unit comprises: an analysing module, configured to converge the attribute values of each piece of data by principal component analysis to obtain new attribute values of each piece of data.
  • the merging unit comprises: a merging module, configured to directly merge attribute values of each piece of data to obtain new attribute values of each piece of data.
  • the merging unit further includes: a second acquiring module, configured to: after the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data, obtain a new one of each piece of data a priority of the attribute value; the filtering module is configured to filter out the one or more new attribute values from the all new attribute values according to the priority of each new attribute value; the learning module is configured to filter out The one or more new attribute values are machine learning to obtain a data model.
  • a second acquiring module configured to: after the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data, obtain a new one of each piece of data a priority of the attribute value
  • the filtering module is configured to filter out the one or more new attribute values from the all new attribute values according to the priority of each new attribute value
  • the learning module is configured to filter out The one or more new attribute values are machine learning to obtain a data model.
  • the learning unit includes: a classification module, configured to classify each piece of data according to a new attribute value of each piece of data; and an obtaining module configured to compare data according to an occurrence time and time The frequency is learned to obtain the data model, wherein the occurrence time and the frequency are one of the basis for distinguishing the abnormal data.
  • a storage medium including a stored program, wherein a device in which the storage medium is located is controlled to execute the above method while the program is running.
  • a processor for running a program wherein the program executes the above method while it is running.
  • an abnormal data detecting method adopts acquiring text; extracting a plurality of pieces of data from the text, wherein each piece of the plurality of pieces of data is composed of a plurality of attribute values, each attribute value pair An attribute of the data, the attributes of each piece of data are the same; the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data, wherein the new attribute value of each piece of data
  • the number of attribute values is less than the number of attribute values of each piece of data before convergence; machine learning is performed using a new attribute value of each piece of data to obtain a data model, wherein the data model is used to distinguish data extracted from text Whether it is abnormal data.
  • FIG. 1 is a flow chart of an abnormal data detecting method according to an embodiment of the present invention.
  • FIG. 2 is a structural diagram of an abnormal data detecting apparatus according to an embodiment of the present invention.
  • the embodiment of the invention provides an abnormal data detecting method.
  • 1 is a flow chart of an abnormal data detecting method according to an embodiment of the present invention. As shown in Figure 1, the method includes the following steps:
  • Step S102 acquiring text
  • Step S104 extracting a plurality of pieces of data from the text, wherein each piece of the plurality of pieces of data is composed of a plurality of attribute values, each attribute value corresponds to an attribute of the data, and the attributes of each piece of data are the same;
  • Step S106 converging attribute values of each piece of data to obtain new attribute values of each piece of data, wherein the number of new attribute values of each piece of data is less than the number of attribute values of each piece of data before converging;
  • Step S108 Performing machine learning using a new attribute value of each piece of data to obtain a data model, wherein the data model is used to distinguish whether the data extracted from the text is abnormal data.
  • the attribute values of the thousands of records are divided into different categories by using the method of the embodiment, that is, different attributes, for example, the S corresponding attribute is a letter, the date corresponding attribute is a date, and the 11 corresponding attribute is a number. Then reduce many different dimensions (one attribute represents one more dimension) to a few dimensions, that is, each data record is assigned a new attribute, and each data record has new attributes under the new attribute division.
  • the attribute value of the data is reduced in dimension, which is different from the prior art in that the data is directly used for machine learning training to obtain a detection model. Therefore, the problem caused by the large data dimension for training existing in the prior art is solved, the training efficiency is improved, the accuracy of the training result is improved, and the field knowledge or domain knowledge can be used without limitation.
  • the embodiment performs a more accurate detection of abnormal data.
  • the data obtained in the above text may be in various forms of data.
  • the obtained data form may be data expressed in a natural language or data expressed in other languages.
  • any natural language-based anomaly detection can be detected, for example, a set of statistical table data, etc., and abnormality detection can be performed on the data in the form of a table, and abnormality detection can be performed on the data in the form of machine diary, thereby increasing the abnormality detection.
  • Universality makes the method of this embodiment suitable for a variety of situations.
  • the object to be tested can be converted into data that is easy for machine learning.
  • the text data is converted into pieces of data for machine learning.
  • the conversion of text data into machine learning data is mainly through the formalization of text data processing, the normalization process can have three different implementations, each of the embodiments can be arbitrarily combined, the first optional The embodiment is: removing special characters in the text data; the second optional implementation manner is: changing uppercase letters in the text data to lowercase letters; the third optional implementation manner is: extracting the text data The plurality of attribute values.
  • the plurality of attribute values in the text data need to be extracted, and each piece of data may be composed of consecutive digital letters, and the data is incapable of understanding.
  • the attribute value can also be extracted for multiple pieces of data.
  • attribute values There are two ways to extract attribute values.
  • multiple attribute values can be extracted from multiple pieces of data used for machine learning by means of word segmentation analysis.
  • the word segmentation analysis is through planning participles and statistical participles.
  • a mixed participle pair divides a set of data as a sentence for word segmentation and splits into a plurality of participles.
  • a piece of data can be used by the method of statistical word segmentation. For splitting, for example, if a piece of data is "date21date3monthxyz", the data is divided into “21", “3", “xyz", "date”, and "month” by statistical segmentation. For example, if the data is "GetAndPublishWebService@fail.”, the text data is first normalized into data for machine learning. At this time, the data becomes "getandpublishwebservicefail".
  • the uppercase letters in the data are changed to lowercase.
  • the letter, the special character @ is removed, and then the data is differentiated into "get", "and”, “publish”, "web", “service”, "fail” by means of statistical word segmentation without domain knowledge. ".
  • the method for statistical word segmentation in this embodiment can support Chinese or English.
  • the original data can be split into one or more word segmentation phrases. For example, I like apples and can perform word segmentation processing. After turning into me, like, Apple, you can also break down the statistics into my favorite and like Apple.
  • a plurality of attribute values from a plurality of pieces of data by the method of word segmentation, it is also possible to extract a plurality of attribute values from a plurality of pieces of data for machine learning by word frequency analysis, that is, repeating words for each data, A word with a high probability of repeated occurrence is extracted as an attribute value.
  • word frequency analysis that is, repeating words for each data
  • a word with a high probability of repeated occurrence is extracted as an attribute value.
  • a piece of data is “date21date3monthxyz”, and the “date” with the highest probability of occurrence of the word frequency is extracted by the word frequency statistical analysis method. This makes machine learning more convenient and increases the accuracy and efficiency of machine learning.
  • the attribute values of each piece of data are converged by principal component analysis to obtain each A new attribute value for a piece of data.
  • Principal component analysis is a method of dimensional reduction of multidimensional. Principal component analysis is also called principal component analysis. The idea of dimensionality reduction is used to transform multiple indicators into a few comprehensive indicators.
  • the total variance of the variables is kept constant, so that the first variable has the largest variance, which is called the first principal component, and the variance of the second variable is the second largest, and is not related to the first variable, and is called the second principal component. .
  • one variable has one principal component.
  • the original index is converted into a new index, that is, the attribute of each piece of data is changed into a new attribute, and the new attribute is smaller than the original attribute quantity, and each piece of data is The attribute value also becomes the new attribute value.
  • the second implementation is to directly combine the attribute values of each piece of data to obtain new attribute values for each piece of data.
  • Direct merging refers to the direct merging of some similar attributes. For example, attributes in numeric form can be used as a similar attribute. For example, attributes in time form can be used as a similar attribute, and attributes in text form can also be used. The attribute corresponding to the value is used as a similar attribute, and then the attribute values corresponding to the similar attributes mentioned above can be combined to achieve the purpose of dimension reduction.
  • the lossless feature combination can effectively reduce the data dimension without reducing the effect of machine learning, and can also ensure that the retained data dimension can still have the maximum representativeness, thereby increasing the accuracy of the anomaly detection. .
  • the data model can also be filtered according to the new attribute worthy priority.
  • new attributes of each piece of data can be obtained first. The priority of the value; then, according to the priority of each new attribute value, one or more new attribute values are filtered out from all the new attribute values; finally, the machine model is obtained by machine learning according to the filtered one or more new attribute values.
  • the priority of the new attribute value may refer to the attribute value that best represents the characteristics of the data in the new attribute value, or may be assigned a priority according to different situations. For example, when counting a set of data, more attention is paid to "error". The data of the word is abnormal, so the data with the phrase "error" in the word attribute can be used as the highest priority. Machine learning is performed based on the attribute value to obtain a data model.
  • the time and frequency of occurrence of the abnormal data can also be used as a criterion for screening the abnormal data.
  • each piece of data can be classified according to the new attribute value of each piece of data.
  • the data of the same classification is learned according to the time and frequency of occurrence, and the time and frequency are taken as one of the basis for distinguishing the abnormal data.
  • the following example illustrates the occurrence time as a distinction between abnormal data: a mathematical model that can be obtained when a set of data repeatedly appears within a certain period of time, and can be immediately determined to be abnormal when the set of data suddenly disappears.
  • Table 1 is a data table to be detected according to an abnormal data detecting method according to an embodiment of the present invention, as shown in Table 1,
  • each horizontal row represents a set of data
  • each set of data has many columns, that is, there are many attributes, such as Gender, Height, etc.
  • each attribute of the data has a corresponding attribute value
  • each piece of data is composed of Multiple attribute values are formed.
  • the attribute values of the first piece of data with ID 1 are: 1, 165, 55, 1, and 1, and the corresponding attributes are: Gender, Height, Age, city, and Occupation.
  • the city and Occupation columns of the data table to be tested in Table 1 may be replaced with numbers;
  • Table 1 The attribute values of each set of data in Table 1 are dimensionally reduced by principal component analysis to obtain new attribute values (new features) of each piece of data, and Table 2 is a dimensionality reduction method of an abnormal data detecting method according to an embodiment of the present invention.
  • the data table is shown in Table 2:
  • this embodiment changes 5 columns into 3 columns, which are PC1, PC2, and PC3, respectively.
  • the new feature (new attribute value) is a linear transformation of the original column.
  • the specific transformation formula is as follows:
  • PC1 -0.3085328*Gender+0.3260416*Height+0.5555709*Age+0.5013550*City-0.4883529*Occupation;
  • PC2 0.3574484*Gender-0.5767465*Height+0.4192386*Age-0.3488463*City-0.4920766*Occupation;
  • PC3 -0.87057667*Gender-0.43415427*Height-0.09021272*Age-0.20623074*Cit y-0.05419287*Occupation;
  • the attribute of each group of data becomes PC1, PC2, and PC3, but the new attribute value is obtained by transforming the original attribute value, that is, it is composed of the original attribute, so the new attribute retains the information characteristics of the original attribute. .
  • the new attribute value of each piece of data in the changed data table is machine-learned to obtain a data model, and the data model is used to distinguish whether the data extracted from the text is abnormal data.
  • Table 4 is a data table to be detected according to an abnormal data detecting method according to an embodiment of the present invention. First, it is determined whether the attribute value of each group of data in the data table to be detected in Table 4 is text, number or time. The text, number, and time columns of each set of data are directly combined to obtain new attribute values for each set of data. The combined data table is shown in Table 5.
  • the new attribute value of each piece of data in the changed data table is machine-learned to obtain a data model, and the data model is used to distinguish whether the data extracted from the text is abnormal data.
  • the embodiment of the invention further provides an abnormal data detecting device.
  • the device can realize its function through an acquisition unit, an extraction unit, a convergence unit, and a learning unit.
  • an abnormal data detecting apparatus may be configured to perform an abnormal data detecting method provided by an embodiment of the present invention.
  • An abnormal data detecting device provided by the embodiment of the invention is executed.
  • an abnormal data detecting apparatus includes:
  • the obtaining unit 22 is configured to acquire text
  • the extracting unit 24 is configured to extract a plurality of pieces of data from the text, wherein each of the plurality of pieces of data is composed of a plurality of attribute values, each attribute value corresponds to an attribute of the data, and the attributes of each piece of data are the same;
  • the merging unit 26 is configured to merge the attribute values of each piece of data to obtain a new attribute value of each piece of data, wherein the number of new attribute values of each piece of data is less than the number of attribute values of each piece of data before converging ;
  • the learning unit 28 is arranged to perform machine learning using a new attribute value of each piece of data to obtain a data model, wherein the data model is used to distinguish whether the data extracted from the text is abnormal data.
  • the obtaining unit comprises: a first obtaining module configured to acquire data expressed in a natural language in the text.
  • the extraction unit comprises a conversion module configured to convert the text data into a plurality of pieces of data for machine learning.
  • the extracting unit comprises: an extracting module configured to convert the text data into a plurality of pieces of data for machine learning and then extract a plurality of attributes from the plurality of pieces of data for machine learning by word segmentation analysis Values or multiple word attribute values are extracted from multiple pieces of data for machine learning by word frequency analysis.
  • the merging unit comprises: an analysing module configured to merge the attribute values of each piece of data by principal component analysis to obtain new attribute values of each piece of data.
  • the merging unit comprises: a merging module, configured to directly merge attribute values of each piece of data to obtain new attribute values of each piece of data.
  • the merging unit further includes: a second acquiring module, configured to acquire a new attribute value of each piece of data after the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data Priority; a screening module configured to filter out one or more new attribute values from all new attribute values according to the priority of each new attribute value; the learning module is set to filter one or more new attributes based on The value is machine learning to get the data model.
  • a second acquiring module configured to acquire a new attribute value of each piece of data after the attribute values of each piece of data are merged to obtain a new attribute value of each piece of data Priority
  • a screening module configured to filter out one or more new attribute values from all new attribute values according to the priority of each new attribute value
  • the learning module is set to filter one or more new attributes based on The value is machine learning to get the data model.
  • the learning unit includes: a classification module configured to classify each piece of data according to a new attribute value of each piece of data; and an obtaining module configured to compare data according to occurrence time and frequency to the same class Learning to obtain a data model, in which time and frequency occur as one of the basis for distinguishing abnormal data.
  • the abnormal data detecting device corresponds to an abnormal data detecting method, so the beneficial effects will not be described again.
  • some optional embodiments in the foregoing embodiments have the following technical effects as compared with the prior art detection:
  • Training results are prone to overfitting (ie, training results are closer to the characteristics of the training dataset than to the characteristics of the target dataset).
  • An embodiment of the present invention provides a storage medium, where the storage medium includes a stored program, wherein the device where the storage medium is located is executed during the running of the program to execute the foregoing method.
  • a simple reduction in the data dimension can lead to a decrease in the amount of computation, but if the removal is a representative dimension of the training objectives, it will also result in a decrease in the accuracy or reliability of the training results. Therefore, we need to use effective methods to reduce the dimensionality of the data, while ensuring that the retained dimensions can still be maximized (ie, lossless dimensionality reduction).
  • This embodiment utilizes lossless feature merging to reduce the data dimension while not reducing the effects of machine learning.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, mobile terminal, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
  • the solution provided by the embodiment of the present invention can be applied to the process of detecting data.
  • the embodiment of the present invention solves the problems caused by the large data dimension for training existing in the prior art, improves the training efficiency, and improves the accuracy of the training result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种异常数据检测方法和装置。其中,该方法包括:获取文本;从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;使用所述每一条数据的新属性值进行机器学习得到数据模型,其中,所述数据模型用于区分从文本中提取的数据是否为异常数据。通过本发明解决了现有技术中所存在的用于培训的数据维度大导致的问题,提高培训效率的同时提高了培训结果的准确度。

Description

异常数据检测方法和装置
本申请要求于2017年03月10日提交中国专利局、申请号为201710145015.0、发明名称“异常数据检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据检测技术领域,具体而言,涉及异常数据检测方法和装置。
背景技术
现有技术中在没有领域或领域知识有限的情况下从近自然语言文本中检测异常通常有所限制,例如,对于机器日志,实现当机器日志中出现异常时,可以利用机器学习得到的模型,从而检测到机器日志中的异常;对于机器学习,如果用于机器学习培训的数据维度过大,则会带来如下不良效果:
1)计算量显著上升,计算成本增加,机器学习的培训效率下降;
2)培训结果易于过度拟合,即培训结果更接近原来培训数据集的特征,而不是目标数据集的特征。
3)现有技术中通过简单的削减数据维度,此种方法虽然可以带来计算量的下降,但是该方法有时会去除对培训目标具有代表性的维度,会使培训结果的精度或可靠度下降。
针对现有技术中所存在的用于培训的数据维度大导致的问题,目前尚未提出有效的解决方案。
发明内容
本发明提供了一种异常数据检测方法和装置,以解决现有技术中所存在的用于培训的数据维度大导致的问题。
根据本发明实施例的一个方面,提供了一种异常数据检测方法,包括:获取文本;从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;使用所述每一条数 据的新属性值进行机器学习得到数据模型,其中,所述数据模型用于区分从文本中提取的数据是否为异常数据。
可选地,获取所述文本包括:获取所述文本中用自然语言表达的数据。
可选地,从所述文本中提取多条数据包括:将所述文本数据转化为用于机器学习的多条数据。
可选地,将所述文本数据转化为用于机器学习的多条数据包括:对所述文本数据进行正规化处理,其中,所述正规化处理为去除所述文本数据中的特殊字符和/或将所述文本数据中的大写字母变为小写字母和/或提取所述文本数据中的所述多个属性值。
可选地,提取所述文本数据中的所述多个属性值包括:通过分词分析从所述用于机器学习的多条数据中提取多个属性值或通过词频分析从所述用于机器学习的多条数据中提取多个属性值。
可选地,对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值包括:对所述每一条数据的属性值通过主成分分析进行汇合得到所述每一条数据的新属性值。
可选地,对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值包括:将所述每一条数据的属性值直接合并得到所述每一条数据的新属性值。
可选地,对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值之后包括:获取所述每一条数据的新属性值的优先级;根据每一个新属性值的优先级从所述所有新属性值中筛选出所述一个或多个新属性值;根据筛选出的所述一个或多个新属性值进行机器学习得到数据模型。
可选地,使用所述每一条数据的新属性值进行机器学习得到数据模型包括:根据所述每一条数据的新属性值将所述每一条数据进行分类;对相同分类的数据根据发生时间和频率进行学习得到所述数据模型,其中,所述发生时间和所述频率作为区分异常数据的根据之一。
根据本发明实施例的另一方面,提供了一种异常数据检测装置。根据本发明的异常数据检测装置包括:获取单元,被设置为获取文本;提取单元,被设置为从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;汇合单元,被设置为对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;学习单元,被设置为使用所述每一条数据的新属性值进行机器学习得到数据模型,其中,所 述数据模型用于区分从文本中提取的数据是否为异常数据。
可选地,所述获取单元包括:第一获取模块,被设置为获取所述文本中用自然语言表达的数据。
可选地,所述提取单元包括:转化模块,被设置为将所述文本数据转化为用于机器学习的多条数据。
可选地,所述提取单元包括:提取模块,被设置为将所述文本数据转化为用于机器学习的多条数据之后通过分词分析从所述用于机器学习的多条数据中提取多个属性值或通过词频分析从所述用于机器学习的多条数据中提取多个属性值。
可选地,所述汇合单元包括:分析模块,被设置为对所述每一条数据的属性值通过主成分分析进行汇合得到所述每一条数据的新属性值。
可选地,所述汇合单元包括:合并模块,被设置为将所述每一条数据的属性值直接合并得到所述每一条数据的新属性值。
可选地,所述汇合单元,还包括:第二获取模块,被设置为对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值之后获取所述每一条数据的新属性值的优先级;筛选模块,被设置为根据每一个新属性值的优先级从所述所有新属性值中筛选出所述一个或多个新属性值;学习模块,被设置为根据筛选出的所述一个或多个新属性值进行机器学习得到数据模型。
可选地,所述学习单元包括:分类模块,被设置为根据所述每一条数据的新属性值将所述每一条数据进行分类;获取模块,被设置为对相同分类的数据根据发生时间和频率进行学习得到所述数据模型,其中,所述发生时间和所述频率作为区分异常数据的根据之一。
根据本发明实施例的另一方面,提供了一种存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行上述方法。
根据本发明实施例的另一方面,提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行上述方法。
根据发明实施例,一种异常数据检测方法采用获取文本;从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;使用所述每一条数据的新属性值进行机器学习得到数据模型,其中,所述数据模型用于区分从文本中提取的数据是否为异常数 据。通过本发明解决了现有技术中所存在的用于培训的数据维度大导致的问题,提高培训效率的同时提高了培训结果的准确度。
附图说明
构成本申请的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的一种异常数据检测方法的流程图;
图2是根据本发明实施例的一种异常数据检测装置的结构图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本发明实施例提供了一种异常数据检测方法。图1是根据本发明实施例的一种异常数据检测方法的流程图。如图1所示,该方法包括步骤如下:
步骤S102,获取文本;
步骤S104,从文本中提取多条数据,其中,多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,每一条数据的属性均相同;
步骤S106,对每一条数据的属性值进行汇合得到每一条数据的新属性值,其中, 每一条数据的新属性值的个数小于汇合之前的每一条数据的属性值的个数;
步骤S108,使用每一条数据的新属性值进行机器学习得到数据模型,其中,数据模型用于区分从文本中提取的数据是否为异常数据。
例如,当机器日记中有几千条记录,直接对它们进行机器学习培训时会存在两个问题,一个是计算量太过庞大,另一个是由于数据记录很多,所以在训练时很容易将比较少的特别数据排除掉,而这些数据往往就是有问题的数据,是需要进行检测到异常的数据。每一条记录都由不同的属性值组成,比如一条数据记录是S=F(x)/date xyz11…,则该条记录中的属性值有S、date、11、=F(x)/、xyz等,通过本实施例的方法将几千条记录的属性值进行划分为不同的类别,即不同的属性,比如S对应属性是字母、date对应属性是日期、11对应属性是数字。然后将很多不同的维度(多一个属性代表多一个维度)缩减为少数几个维度,也就是每条数据记录分配了新的属性,每条数据记录在新的属性的划分下都有新的属性值,比如,新的属性值可以为S=F(x)/xyz和date11,对新的属性值进行机器学习得到数据模型,根据该数据模型区分判断每一条数据记录是否为异常数据。
在上述步骤采用了将数据的属性值进行降维,这不同于现有技术中,直接使用数据进行机器学习培训得到检测模型的方法。从而解决了现有技术中所存在的用于培训的数据维度大导致的问题,在提高培训效率的同时提高了培训结果的准确度,同时可以在没有领域知识或者领域知识有限的情况下运用该实施例对异常数据进行较准确的检测。
在上述文本中获取的数据可以是多种形式的数据,在一个可选的实施方式中,获取的数据形式可以是用自然语言表达的数据,还可以是用其他语言表达的数据。
通过该方式可以检测任何基于自然语言的异常检测,比如,一组统计表格数据等,还可以对表格形式的数据进行异常检测,可以对机器日记形式的数据进行异常检测,从而增加了异常检测的普遍性,使本实施例的方法适用于多种情况。
在获取到用自然语言表达的文本数据后,可以将所检验的对象转化为便于机器学习的数据,在一个可选的实施方式中,即将文本数据转化为用于机器学习的多条数据。
通过上述过程将文本数据转化为适用于机器学习的数据后,更加便于培训模型,从而增加机器学习的效率。
将文本数据转化为适用于机器学习的数据主要是通过正规化对文本数据进行处理,正规化处理可以有三种不同的实施方式,每种实施方式之间可以任意进行组合,第一个可选的实施方式是:去除文本数据中的特殊字符;第二种可选的实施方式是:将文本数据中的大写字母变为小写字母;第三种可选的实施方式是:提取所述文本数据中 的所述多个属性值。
上述将文本数据转化为用于机器学习的多条数据时需要提取所述文本数据中的所述多个属性值,每一条数据可能是连续的一些数字字母组成的,这种数据是没有办法了解到其属性值的,而在很多没有领域知识或者领域知识有限的情况下时,所得到的数据文本都常常是没有属性值的,遇到以上这些情况时,还可以对多条数据提取属性值,提取属性值的方式有两种,在一个可选的实施方式中,可以通过分词分析的方法从用于机器学习的多条数据中提取多个属性值,分词分析是通过规划分词、统计分词或混合分词对将一组数据作为一个句子进行分词处理,拆分为多个分词,下面以规划分词对上述分词分析进行解释,比如一条数据为“error=21date3monthxyz”,通过在预先设定的分词词典中寻找类似的词进行分词,比如分词词典中设置了“error”、“date”、“month”,就将这些词切下,作为特征提取出来,也就是作为数据的属性值提取出来。上述规划分词的分词方法适用于已有分词词典的情况下,有时一组数据中会出现一些词典中没有出现的词,也就是完全没有领域知识的情况,则可以使用统计分词的方法将一条数据进行拆分,比如,一条数据为“date21date3monthxyz”,则通过统计分词将该条数据拆分为“21”、“3”、“xyz”、“date”和“month”等。再比如一条数据为“GetAndPublishWebService@fail.”,首先将该文本数据进行正规化处理转化为用于机器学习的数据,此时该条数据变为“getandpublishwebservicefail”该条数据中的大写字母变为了小写字母,特殊字符@被去除,然后在没有领域知识的情况下,通过统计分词的方式将该条数据差分为“get”、“and”、“publish”、“web”、“service”、“fail”。本实施例中统计分词的方法可以支持中文或英文,在进行统计分词时,可以将原有数据拆分为一个或一个以上的词所组成的分词词组,例如,我喜欢苹果,可以进行分词处理后变为我、喜欢、苹果,也可以进行统计分词后分解为我喜欢、喜欢苹果。
除了上述通过分词分析的方法从多条数据中提取多个属性值,还可以通过词频分析从用于机器学习的多条数据中提取多个属性值,即对每一数据统计重复出现的词,将重复出现概率大的词作为一个属性值提取出,例如,一条数据为“date21date3monthxyz”通过词频统计分析方法将词频出现概率最高的“date”提取出。从而更加便于机器学习,增加机器学习的准确性和效率。
在上述步骤中,对每一条数据的属性值进行汇合得到每一条数据的新属性值时有两种实施方式,第一种实施方式是对每一条数据的属性值通过主成分分析进行汇合得到每一条数据的新属性值。主成分分析是将多维进行降维的一种方法,主成分分析也称主分量分析,利用降维的思想把多指标转化为少数几个综合指标。在数学变换中保持变量的总方差不变,使第一变量具有最大的方差,称为第一主成分,第二变量的方差次大,并且和第一变量不相关,称为第二主成分。依次类推,1个变量就有1个主 成分。本实施例的每一条数据经过主成分分析后,将原本的指标转化为新的指标,也就是将每一条数据的属性变为了新的属性,新的属性小于原本的属性数量,每一条数据的属性值也变为新的属性值。
第二种实施方式是将每一条数据的属性值直接合并得到每一条数据的新属性值。直接合并是指将一些类似的属性直接合并,比如都是数字形式的属性可以作为一种类似的属性,再比如都是时间形式的属性可以作为一种类似的属性,还可以将文本形式的属性值对应的属性作为一种类似的属性,然后可以合并上述这些类似属性对应的属性值,从而达到降维的目的。
通过上述合并方式,利用无损的特征合并有效的减少数据维度的同时,又不降低机器学习的效果,还可以保证所保留的数据维度仍然可以具有最大限度的代表性,从而增加异常检测的准确性。
对每一条数据的属性值进行汇合得到每一条数据的新属性值之后,还可以根据新属性值得优先级来筛选数据模型,在一个可选的实施方式中,可以先获取每一条数据的新属性值的优先级;再根据每一个新属性值的优先级从所有新属性值中筛选出一个或多个新属性值;最后根据筛选出的一个或多个新属性值进行机器学习得到数据模型。
新的属性值的优先级可以是指新的属性值中最能代表数据特点的属性值,也可以是根据不同情况来指定一个优先级,比如,统计一组数据时,更加关注带有“error”一词的数据异常情况,因此可以将单词属性中的带有“error”一词开头词组的数据作为最高优先级。根据该属性值进行机器学习得到数据模型。
通过上述过程中的选取优先级来筛选掉一部分属性值,减少数据的维度,将筛选出的新属性值进行机器学习得到数据模型,从而增加计算的速度,减少计算成本。
在对每一条数据的新属性值进行机器学习得到数据模型时,还可以将异常数据的发生时间和频率作为筛选异常数据的标准,例如可以根据每一条数据的新属性值将每一条数据进行分类;对相同分类的数据根据发生时间和频率进行学习得到数据模型,其中,发生时间和频率作为区分异常数据的根据之一。下面举例对于发生时间作为区分异常数据进行说明:当一组数据在某一段时间内反复出现时就可以得到的数学模型,当该组数据突然不再出现时就可以立刻判定为异常。有时机器日记出现问题时,会表现为同样的数据反复出现的情况,或者下面举例对于频率作为区分异常数据进行说明:根据一组数据出现的频率学习建立判定异常数据的数学模型,当该组数据出现的频率突然改变时,则可以根据数学模型判定为异常的数据。
下面结合一个可选的实施例进行说明。
表1是根据本发明实施例的一种异常数据检测方法的待检测数据表,如表1所示,
Figure PCTCN2018077507-appb-000001
表1
在该数据表中,每一横排代表一组数据,每组数据有很多列,即有很多属性,比如Gender、Height等,数据的每一个属性都有对应的属性值,每一条数据均由多个属性值构成。比如,ID为1的第1条数据的属性值有:1、165、55、1、1,分别对应的属性为:Gender、Height、Age、city、Occupation。表1中待检测数据表的city和Occupation等列可以使用数字替代;
对表1中每一组数据的属性值通过主成分分析进行降维得到每一条数据的新属性值(新的feature),表2是根据本发明实施例的一种异常数据检测方法的降维数据表,如表2所示:
Figure PCTCN2018077507-appb-000002
表2
通过降维,本实施例将5列变成了3列,分别是PC1、PC2、PC3,新的feature(新的属性值)是原有列的线性变换,具体变换公式如下:
PC1=-0.3085328*Gender+0.3260416*Height+0.5555709*Age+0.5013550*City-0.4883529*Occupation;
PC2=0.3574484*Gender-0.5767465*Height+0.4192386*Age-0.3488463*City-0.4920766*Occupation;
PC3=-0.87057667*Gender-0.43415427*Height-0.09021272*Age-0.20623074*Cit y-0.05419287*Occupation;
经过变化后的数据表如表3所示:
Figure PCTCN2018077507-appb-000003
表3
每组数据的属性变为PC1、PC2、PC3,但是新的属性值是由原有的属性值经过变换得到,也就是由原有的属性组成,所以新的属性保留有原有属性的信息特点。
将经过变化后的数据表中每一条数据的新属性值进行机器学习得到数据模型,通过数据模型区分从文本中提取的数据是否为异常数据。
下面结合另一个可选的实施例进行说明。
以表4为例,表4是根据本发明实施例的一种异常数据检测方法的待检测数据表,先判断表4待检测数据表中每组数据的属性值是文本、数字还是时间,将每组数据的文本,数字,时间等列分别进行直接合并,得到每组数据的新属性值。合并后的数据表如表5所示。
Figure PCTCN2018077507-appb-000004
表4
Figure PCTCN2018077507-appb-000005
表5
将经过变化后的数据表中每一条数据的新属性值进行机器学习得到数据模型,通过数据模型区分从文本中提取的数据是否为异常数据。
本发明实施例还提供了一种异常数据检测装置。该装置可以通过获取单元、提取单元、汇合单元和学习单元实现其功能。需要说明的是,本发明实施例的一种异常数据检测装置可以被设置为执行本发明实施例所提供的一种异常数据检测方法,本发明实施例的一种异常数据检测方法也可以通过本发明实施例所提供的一种异常数据检测装置来执行。
图2是根据本发明实施例的一种异常数据检测装置的示意图。如图2所示,一种异常数据检测装置包括:
获取单元22,被设置为获取文本;
提取单元24,被设置为从文本中提取多条数据,其中,多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,每一条数据的属性均相同;
汇合单元26,被设置为对每一条数据的属性值进行汇合得到每一条数据的新属性值,其中,每一条数据的新属性值的个数小于汇合之前的每一条数据的属性值的个数;
学习单元28,被设置为使用每一条数据的新属性值进行机器学习得到数据模型,其中,数据模型用于区分从文本中提取的数据是否为异常数据。
在一个可选的实施方式中,获取单元包括:第一获取模块,被设置为获取文本中用自然语言表达的数据。
在一个可选的实施方式中,提取单元包括:转化模块,被设置为将文本数据转化为用于机器学习的多条数据。
在一个可选的实施方式中,提取单元包括:提取模块,被设置为将文本数据转化 为用于机器学习的多条数据之后通过分词分析从用于机器学习的多条数据中提取多个属性值或通过词频分析从用于机器学习的多条数据中提取多个属性值。
在一个可选的实施方式中,汇合单元包括:分析模块,被设置为对每一条数据的属性值通过主成分分析进行汇合得到每一条数据的新属性值。
在一个可选的实施方式中,汇合单元包括:合并模块,被设置为将每一条数据的属性值直接合并得到每一条数据的新属性值。
在一个可选的实施方式中,汇合单元,还包括:第二获取模块,被设置为对每一条数据的属性值进行汇合得到每一条数据的新属性值之后获取每一条数据的新属性值的优先级;筛选模块,被设置为根据每一个新属性值的优先级从所有新属性值中筛选出一个或多个新属性值;学习模块,被设置为根据筛选出的一个或多个新属性值进行机器学习得到数据模型。
在一个可选的实施方式中,学习单元包括:分类模块,被设置为根据每一条数据的新属性值将每一条数据进行分类;获取模块,被设置为对相同分类的数据根据发生时间和频率进行学习得到数据模型,其中,发生时间和频率作为区分异常数据的根据之一。
上述一种异常数据检测装置实施例是与一种异常数据检测方法相对应的,所以对于有益效果不再赘述。通过上述实施例的分析描述,相对于现有技术检测来说,上述实施例中的部分可选实施方式有以下技术上的效果:
对于机器学习,如果用于培训的数据维度过大,则会带来如下不良效果:
1)计算量显著上升,计算成本增加,培训效率下降;
2)培训结果易于过度拟合(即培训结果更接近培训数据集的特征,而不是目标数据集的特征)。
本发明实施例提供了一种存储介质,存储介质包括存储的程序,其中,在程序运行时控制存储介质所在设备执行上述方法。
简单的削减数据维度虽然可以带来计算量的下降,但是如果去除的是对培训目标具有代表性的维度,那么也会造成培训结果的精度或可靠度下降。因此我们需要利用有效的方法来降低数据的维度,同时保证所保留的维度仍然可以具有最大限度的代表性(即无损降维)。本实施例利用无损的特征合并减少数据维度,同时不降低机器学习的效果。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系 列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、移动终端、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
本发明实施例提供的方案,可以应用于对数据的检测过程中。通过本发明实施例 解决了现有技术中所存在的用于培训的数据维度大导致的问题,提高培训效率的同时提高了培训结果的准确度。

Claims (10)

  1. 一种异常数据检测方法,包括:
    获取文本;
    从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;
    对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;
    使用所述每一条数据的新属性值进行机器学习得到数据模型,其中,所述数据模型用于区分从文本中提取的数据是否为异常数据。
  2. 根据权利要求1所述的方法,其中,获取所述文本包括:
    获取所述文本中用自然语言表达的数据。
  3. 根据权利要求2所述的方法,其中,从所述文本中提取多条数据包括:
    将所述文本数据转化为用于机器学习的多条数据。
  4. 根据权利要求3所述的方法,其中,将所述文本数据转化为用于机器学习的多条数据包括:
    对所述文本数据进行正规化处理,其中,所述正规化处理为去除所述文本数据中的特殊字符和/或将所述文本数据中的大写字母变为小写字母和/或提取所述文本数据中的所述多个属性值。
  5. 根据权利要求3所述的方法,其中,提取所述文本数据中的所述多个属性值包括:
    通过分词分析从所述用于机器学习的多条数据中提取多个属性值,或者,
    通过词频分析从所述用于机器学习的多条数据中提取多个属性值。
  6. 根据权利要求1至5任意一项所述的方法,其中,对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值包括:
    对所述每一条数据的属性值通过主成分分析进行汇合得到所述每一条数据的新属性值。
  7. 根据权利要求1至5任意一项所述的方法,其中,对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值包括:
    将所述每一条数据的属性值直接合并得到所述每一条数据的新属性值。
  8. 一种异常数据检测装置,包括:
    获取单元,被设置为获取文本;
    提取单元,被设置为从所述文本中提取多条数据,其中,所述多条数据中的每一条数据均由多个属性值构成,每个属性值对应该数据的一个属性,所述每一条数据的属性均相同;
    汇合单元,被设置为对所述每一条数据的属性值进行汇合得到所述每一条数据的新属性值,其中,所述每一条数据的新属性值的个数小于汇合之前的所述每一条数据的属性值的个数;
    学习单元,被设置为使用所述每一条数据的新属性值进行机器学习得到数据模型,其中,所述数据模型用于区分从文本中提取的数据是否为异常数据。
  9. 根据权利要求8所述的装置,其中,所述获取单元包括:
    第一获取模块,被设置为获取所述文本中用自然语言表达的数据。
  10. 一种存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行权利要求1-7任意一项所述的方法。
PCT/CN2018/077507 2017-03-10 2018-02-28 异常数据检测方法和装置 WO2018161824A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710145015.0 2017-03-10
CN201710145015.0A CN107122394B (zh) 2017-03-10 2017-03-10 异常数据检测方法和装置

Publications (1)

Publication Number Publication Date
WO2018161824A1 true WO2018161824A1 (zh) 2018-09-13

Family

ID=59717930

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077507 WO2018161824A1 (zh) 2017-03-10 2018-02-28 异常数据检测方法和装置

Country Status (2)

Country Link
CN (1) CN107122394B (zh)
WO (1) WO2018161824A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122394B (zh) * 2017-03-10 2020-02-14 博彦科技股份有限公司 异常数据检测方法和装置
CN109657947B (zh) * 2018-12-06 2021-03-16 西安交通大学 一种面向企业行业分类的异常检测方法
CN110225207B (zh) * 2019-04-29 2021-08-06 厦门快商通信息咨询有限公司 一种融合语义理解的防骚扰方法、系统、终端及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023927A (zh) * 2013-01-10 2013-04-03 西南大学 一种稀疏表达下的基于非负矩阵分解的入侵检测方法及系统
CN103235803A (zh) * 2013-04-17 2013-08-07 北京京东尚科信息技术有限公司 一种从文本中获取物品属性值的方法和装置
CN104919458A (zh) * 2013-01-11 2015-09-16 日本电气株式会社 文本挖掘设备、文本挖掘系统、文本挖掘方法和记录介质
CN106447383A (zh) * 2016-08-30 2017-02-22 杭州启冠网络技术有限公司 跨时间、多维度异常数据监测的方法和系统
CN107122394A (zh) * 2017-03-10 2017-09-01 博彦科技股份有限公司 异常数据检测方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7673189B2 (en) * 2006-02-06 2010-03-02 International Business Machines Corporation Technique for mapping goal violations to anamolies within a system
CN105553998B (zh) * 2015-12-23 2019-02-01 中国电子科技集团公司第三十研究所 一种网络攻击异常检测方法
CN105868256A (zh) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 处理用户行为数据的方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023927A (zh) * 2013-01-10 2013-04-03 西南大学 一种稀疏表达下的基于非负矩阵分解的入侵检测方法及系统
CN104919458A (zh) * 2013-01-11 2015-09-16 日本电气株式会社 文本挖掘设备、文本挖掘系统、文本挖掘方法和记录介质
CN103235803A (zh) * 2013-04-17 2013-08-07 北京京东尚科信息技术有限公司 一种从文本中获取物品属性值的方法和装置
CN106447383A (zh) * 2016-08-30 2017-02-22 杭州启冠网络技术有限公司 跨时间、多维度异常数据监测的方法和系统
CN107122394A (zh) * 2017-03-10 2017-09-01 博彦科技股份有限公司 异常数据检测方法和装置

Also Published As

Publication number Publication date
CN107122394A (zh) 2017-09-01
CN107122394B (zh) 2020-02-14

Similar Documents

Publication Publication Date Title
Hill et al. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study
CN108628971B (zh) 不均衡数据集的文本分类方法、文本分类器及存储介质
CN110297988B (zh) 基于加权LDA和改进Single-Pass聚类算法的热点话题检测方法
CN111639177B (zh) 文本提取方法和装置
CN105912576B (zh) 情感分类方法及系统
Vadivukarassi et al. Sentimental analysis of tweets using Naive Bayes algorithm
US9817812B2 (en) Identifying word collocations in natural language texts
US20150205862A1 (en) Method and device for recognizing and labeling peaks, increases, or abnormal or exceptional variations in the throughput of a stream of digital documents
CN110019820B (zh) 一种病历中主诉与现病史症状时间一致性检测方法
WO2018161824A1 (zh) 异常数据检测方法和装置
WO2017096777A1 (zh) 文献归一方法、文献搜索方法及对应装置、设备和存储介质
CN104850617A (zh) 短文本处理方法及装置
CN107515849A (zh) 一种成词判定模型生成方法、新词发现方法及装置
Rigaud et al. What do we expect from comic panel extraction?
CN112183093A (zh) 一种企业舆情分析方法、装置、设备及可读存储介质
CN107122395B (zh) 数据抽样方法和装置
US9594757B2 (en) Document management system, document management method, and document management program
Al-Azani et al. Audio-textual Arabic dialect identification for opinion mining videos
EP3477505B1 (en) Fingerprint clustering for content-based audio recogntion
CN110738047A (zh) 基于图文数据与时间效应的微博用户兴趣挖掘方法及系统
CN108021595B (zh) 检验知识库三元组的方法及装置
CN108475265B (zh) 获取未登录词的方法与装置
JP5463873B2 (ja) マルチメディア分類システム及びマルチメディア検索システム
CN103034657A (zh) 文档摘要生成方法和装置
Zendah et al. Detecting Significant Events in Arabic Microblogs using Soft Frequent Pattern Mining.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18763392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18763392

Country of ref document: EP

Kind code of ref document: A1