US20220004843A1 - Convolutional neural network (cnn)-based anomaly detection - Google Patents
Convolutional neural network (cnn)-based anomaly detection Download PDFInfo
- Publication number
- US20220004843A1 US20220004843A1 US17/373,600 US202117373600A US2022004843A1 US 20220004843 A1 US20220004843 A1 US 20220004843A1 US 202117373600 A US202117373600 A US 202117373600A US 2022004843 A1 US2022004843 A1 US 2022004843A1
- Authority
- US
- United States
- Prior art keywords
- similarity
- input
- field
- values
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims description 57
- 238000013527 convolutional neural network Methods 0.000 title abstract description 59
- 239000013598 vector Substances 0.000 claims abstract description 201
- 238000011524 similarity measure Methods 0.000 claims abstract description 140
- 239000011159 matrix material Substances 0.000 claims abstract description 83
- 238000000034 method Methods 0.000 claims abstract description 65
- 238000011156 evaluation Methods 0.000 claims abstract description 52
- 238000003860 storage Methods 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 23
- 230000015654 memory Effects 0.000 claims description 21
- 238000004891 communication Methods 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims 1
- 230000002547 anomalous effect Effects 0.000 abstract description 57
- 238000005516 engineering process Methods 0.000 abstract description 27
- 230000008569 process Effects 0.000 abstract description 15
- 238000012549 training Methods 0.000 description 40
- 238000013528 artificial neural network Methods 0.000 description 21
- 238000004422 calculation algorithm Methods 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 11
- 238000004519 manufacturing process Methods 0.000 description 10
- 238000004140 cleaning Methods 0.000 description 9
- 238000012360 testing method Methods 0.000 description 9
- 238000010801 machine learning Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000001143 conditioned effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 240000001972 Gardenia jasminoides Species 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 235000019687 Lamb Nutrition 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Definitions
- the technology disclosed relates generally to data cleaning apparatus and corresponding methods for the detection of anomalous data stored in a fielded database or as computer files, and in particular relates to implementing a convolutional neural network that takes identifies the anomalous field values and suggests correct field values to replace anomalous field values.
- Appendix A is a presentation showing examples of data structures, algorithms, neural network architectures, and experimental results at a high-level.
- Appendix B is a report showing examples of data structures, algorithms, neural network architectures, and experimental results in greater detail.
- FIG. 1 shows aspects of a system in which a machine learning system detects anomalous field values and suggests correct field values to replace anomalous field values.
- FIG. 2A depicts one implementation of an input generator for anomaly detection network of FIG. 1 using six similarity calculators.
- FIG. 2B shows calculations for determining factor vectors using two example similarity measures to generate an input matrix.
- FIG. 3 shows processing of the input matrix of FIG. 2B by anomaly detection network of FIG. 1 to identify anomalous field values.
- FIG. 4A shows examples of main components of a one-layer anomaly detection network.
- FIG. 4B shows examples of main components of a two-layer anomaly detection network.
- FIG. 5A is an example of an input generator for suggestion network of FIG. 1 using six similarity calculators.
- FIG. 5B shows calculations for determining factor vectors using two example similarity measures to generate an input matrix.
- FIG. 6 shows processing of the input matrix of FIG. 5B by suggestion network of FIG. 1 to suggest correct field value for the anomalous field value.
- FIG. 7 shows examples of main components of a one-layer suggestion network of FIG. 1 .
- FIG. 8 is a simplified block diagram of a computer system that can be used to implement the machine learning system of FIG. 1 .
- BI business intelligence
- a convolutional neural network is trained to identify anomalous field values in a fielded dataset. The CNN speeds up the data cleaning process by reducing the effort required to clean the data by automatically detecting anomalous field values.
- the CNN uses hints based on six similarity measures, which cover both semantic and syntactic aspects of fielded data, to identify anomalous field values in a particular input field.
- the six similarity measures are semantic similarity, syntactical similarity, soundex similarity, length similarity, frequency of occurrence similarity and format similarity.
- a key element of the technology disclosed is extraction of hints from the fielded data for the six similarity measures and organizing these hints in the form of vectors that can be processed by the CNN.
- the training data for the CNN is generated artificially. Separate datasets are generated for five of the similarity measures listed above except the frequency of occurrence similarity measure. Each dataset contains separate collection of similar and dissimilar unique field values. A training dataset is generated by randomly picking field values from similar and dissimilar field values. To generate frequency similarity data, each of the unique field values in the field is multiplied with a randomly generated positive integer. For production, real world testing data is used from public data sources. The higher the quality of input data for training, the higher the accuracy of the CNN in production.
- Data cleaning of a data field is composed of two processes: anomaly detection and anomaly suggestion.
- a separate CNN is used for each of these two processes: an anomaly detection network and a suggestion network.
- the respective CNNs are trained separately.
- Anomaly detection CNN can reach 97% detection for anomaly and 88% detection for non-anomaly on training set, and 60%, 83% detections respectively on testing set.
- Suggestion CNN can reach 60% accuracy on training set and 70% on testing set.
- the anomaly detection and anomaly suggestion networks There are two ways to apply the anomaly detection and anomaly suggestion networks to identify anomalous data and clean it.
- the first option is automatic detection-suggestion flow.
- the anomalous field values identified by the anomaly detection network are automatically passed to the suggestion network.
- the suggestion network identifies a correct field value for each anomalous field value and replaces the anomalous field value with correct field value.
- the second option to combine the anomaly detection and suggestion networks is to get expert feedback after the detection of anomalous field values. This is to avoid the passing false positives to the suggestion network. This helps the user to understand the results and provide feedback.
- the second option to combine the anomaly detection and suggestion networks requires more interaction from an expert, but generally gains higher accuracy.
- FIG. 1 shows an architectural level schematic of a system in accordance with an implementation. Because FIG. 1 is an architectural diagram, certain details are intentionally omitted to improve the clarity of the description.
- FIG. 1 The discussion of FIG. 1 is organized as follows. First, the elements of the figure are described, followed by their interconnections. Then, the use of the elements in the system is described in greater detail.
- FIG. 1 includes the system 100 .
- the system 100 includes a machine learning system 110 , a network 120 , a training database 142 , a testing database 144 and an administrator interface 132 .
- the machine learning system 110 includes an input generator 112 , an anomaly detection network 114 and a suggestion network 116 .
- the input generator 112 generates input data for the anomaly detection network 114 and the suggestion network 116 .
- the data that is consumed by the input generator 112 is provided by the training database 142 during training and by the testing database 144 during production.
- the anomaly detection network 114 determines which field values in a set of field values for a particular field in a fielded dataset are anomalous.
- the suggestion network 116 determines that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset.
- Both anomaly detection network 114 and suggestion network 116 can run on a variety of hardware processors such as graphic processor units (GPUs).
- GPUs graphic processor units
- Neural network-based models involve computationally intensive methods, such as convolutions and matrix-based operations. GPUs are well suited for these types of computations. Recently, specialized hardware is being developed to efficiently train neural network models.
- the input generator 112 receives data from the training database 142 and processes it to generate input for the anomaly detection network 114 and the suggestion network 116 .
- the input generator 112 receives data from the testing database 144 and processes it to produce input for the anomaly detection network 114 and the suggestion network 116 .
- the administrative interface 132 can be used during training or production to provide input data to the input generator 112 from respective databases.
- the administrative interface 132 can also be used during training to control model parameters of the anomaly detection network 114 and the suggestion network 116 .
- the machine learning system 110 , the training database 142 , the testing database 144 and the administrative interface 132 are in communication with each other via the network(s) 120 .
- FIG. 2A illustrates operation of the input generator 112 to generate input data for the anomaly detection network 114 during training.
- the input generator 112 takes a dataset 202 as input.
- the dataset 202 is generated from the training database 142 and during production, the dataset 202 is generated from the testing database 144 .
- the dataset 202 is composed of field values. In the more common use case, these field values are word tokens. In other implementations, field values can be character tokens or phrase tokens. An example of a word token is “John” and an example of a phrase token is “J. P. Morgan”.
- the dataset 202 is a fielded dataset which means it can have fields such as “First Name”, “Last Name”, “Company Name”, “Country” etc., and data is organized as field values in the fields.
- the input generator 112 generates factor vectors 230 for a plurality of similarity measures (also referred to as linguistic similarity measures).
- similarity measures also referred to as linguistic similarity measures.
- the six similarity measures are semantic similarity, syntactical similarity, soundex similarity, length similarity, frequency of occurrence similarity and format similarity. In other implementations fewer than six or more than six similarity measures can be used.
- a factor vector calculator 210 contains similarity measure calculators for each similarity measure.
- FIG. 2A shows six similarity measure calculators corresponding to six similarity measures listed above.
- a semantic similarity calculator 212 calculates semantic similarity measure
- a syntactical similarity calculator 214 calculates syntactical similarity measure
- a soundex similarity calculator 216 calculates soundex similarity measure
- a length similarity calculator 218 calculates length similarity measure
- a frequency of occurrence similarity calculator 220 calculates frequency similarity measure
- a format similarity calculator 222 calculates format similarity measure.
- Semantic similarity between two field values of a particular input field of the fielded dataset is calculated using either Word2Vec, WordNet or GloVe model.
- Two given input field values of the input field are semantically similar to each other if their meaning are similar in the vocabulary. For example, consider a field listing fruit names, a first field value “apple” is semantically similar to a second field value “orange”. In the same field, the first field value “apple” is not semantically similar to a third field value “airplane”. Every field value of the particular field is compared with every other field value in the same field to determine semantic similarity measure between them. For this purpose, two field values are passed to one of the selected models listed above.
- the model represents each field value as a word vector in a low dimensional embedding space (typically between 100 to 1000 dimensions).
- An inner product or dot product of two multidimensional vector representations of the field values is calculated by the selected model.
- the results of the calculations are used to fill the n ⁇ n similarity matrix for semantic similarity measure, where “n” is the number of unique field values (also referred to as words, word phrases or tokens) in the field.
- Syntactic similarity between two field values of a particular field of the fielded dataset is calculated using a so called “bag of letters” algorithm.
- the syntactic similarity between two input field values indicates how similar they are in terms of the arrangement of the characters in the field values. For example, a first field value “ gardenia ” is syntactically similar to a second field value “grandniece” in a particular input field. The first input value “ gardenia ” is not syntactically similar to a third input value “hoff” in the same input field.
- the syntactic similarity is calculated between every two field values of a given field. For this purpose, two field values are given as input to the “bag of letters” algorithm.
- a first step of the algorithm all upper case letters in the two field values are converted to respective lower case letters.
- the characters in both field values are then converted to equivalent American Standard Code for Information Interchange (ASCII) codes.
- ASCII American Standard Code for Information Interchange
- the algorithm represents each input field value as a multidimensional vector in a character embedding space.
- the dimensionality of the embedding space is equal to number of codes in ASCII. Multiple occurrences of the same character are represented as magnitude on that character's respective dimension.
- the “bag of letters” algorithm calculates an inner product or dot product of the multidimensional vector representations of the two input field values. The results of the calculations are used to fill the n ⁇ n similarity matrix for syntactic similarity measure where “n” is the number of unique field values in the particular field.
- Soundex similarity between two field values of a particular field indicates how similar they are in terms of their respective pronunciations in the English language. For example, a first field value “abel” is soundex similar to a second field value “flail” in a particular field. The first field value “abel” is not soundex similar to a third input field value “rowe” in the same field of the fielded dataset.
- a custom algorithm is used to determine the soundex similarity between two input field values by determining corresponding positions of similar sounding English alphabets in respective field values.
- soundex algorithms such as Metaphone or Double Metaphone can be used to determine soundex similarity between two input field values.
- a soundex similarity score of each field value is calculated with every other field value in a particular input field of the fielded dataset.
- the results of the calculations are used to fill the n ⁇ n similarity matrix for soundex similarity measure where “n” is the number of unique words or tokens in the input field.
- Format similarity between two input field values indicates how similar they are in terms of their formats. For example, a first field value “0x12345” is similar in format to a second field value “0x32749” in a particular field. The first field value “0x12345” is not format similar to a third field value “1234567” in the same input field of the fielded dataset.
- a custom format matching algorithm is given two field values of a particular field as input. The format matching algorithm calculates a format similarity score for the two field values by comparing characters at corresponding positions in the two field values. A format similarity score for each field value is calculated with every other field value in a particular input field of the fielded dataset. The results of the calculations are used to fill the n ⁇ n similarity matrix for format similarity measure where “n” is the number of unique field values in the particular field.
- Length similarity measure is used to identify field values that are longer or shorter than average length of field values in a particular field. For example, consider field values “firm”, “lamb”, “coon”, “key”, “septuagenarian”, “x” in a particular input field. The length similarity measure will identify the field values “septuagenarian”, and “x” as anomalous because their respective lengths are too long or too short than average length of field values in the particular input field.
- the algorithm to calculate the length similarity measure first generates the length of each field value. Following this, the algorithm determines a Z-score (also referred to as standard score) using mean and standard deviation. The Z-score is statistical measure of a score's relationship to the mean in a group of scores.
- Z-scores are normalized using the largest Z-score value.
- the normalized values of Z-scores range between “0” and “1” inclusive.
- a normalized Z-score of zero means the score is the same as the mean which means the corresponding input field value lies at the mean.
- a higher normalized Z-score value implies the score is farther away from the mean.
- the corresponding field value is either too short in length or too long in length compared to other field values in the particular field.
- a normalized Z-score of “1” means the corresponding input field value is farthest from the mean and consequently had the highest inverse similarity measure.
- the field values with highest inverse similarity measure values are recommended as anomalous.
- Frequency similarity measure identifies anomalous field values using frequency counts of the unique field values in a particular field of the fielded dataset. Typically, in a field of a fielded dataset, unique values occur multiple times. If a particular field value occurs only once or a very few times, then most likely it is an anomaly. For example, if field values and their corresponding frequency of occurrences in a particular input field are: “J.P. Morgan” (135), “Goldman Sachs” (183), “Citi” (216), “Morgan Stanley” (126), “City” (1). The anomalous field value is “City”, which is the least frequent value with a frequency of occurrence of “1”.
- the six similarity calculators generate six factor vectors 230 corresponding to the six similarity measures.
- the six factor vectors are denoted by the variables “y”, “p”, “q”, “s”, “t”, and “z” respectively.
- Each factor comprises as many elements as unique field values in a field of the fielded dataset.
- Each field value is also referred to as a word, word phrase or a token.
- Each of the six factor vectors 230 consists of “n” values or elements.
- the first factor vector consists of the values y 1 to y n
- the second factor vector consists of the values p 1 to p n
- the third factor vector consists of the values q 1 to q n
- the fourth factor vector consists of the values s 1 to s n
- the fifth factor vector consists of the values t 1 to t n
- the sixth factor vector consists of the values z 1 to z n .
- the six example factor vectors 230 shown in FIG. 2A form an input matrix which is shown in FIG. 2B referred to by a numeral 232 .
- the input matrix 232 is given as input to the anomaly detection network 114 by the input generator 112 .
- FIG. 2B shows further details of how factor vectors are calculated for similarity measures listed above.
- factor vector calculations of two similarity measures are shown as an example.
- a person skilled in the art will appreciate that other factor vectors may be calculated similarly, either using matrix multiplications or scalar interactions.
- the examples shown in FIG. 2B rely on matrix multiplications.
- the first similarity matrix 212 b is for semantic similarity measure. It is a n ⁇ n matrix where n is the number of unique field values in a field of the fielded dataset.
- An inner product (also referred to as a dot product) is calculated between word embedding vector of each unique field value in the field with word embedding vector of every other unique field value in the same field of the fielded dataset.
- These embedding vectors are generated by either Word2Vec word embedding space, GloVe word embedding space, WordNet or any other low dimensional embedding space.
- the inner product between the word embedding vectors of two unique field values produces scalar values which are represented by variable O ab where “a” is the row index and “b” is the column index of the semantic similarity matrix 212 b .
- the elements of the factor vector are row averages of corresponding rows of the similarity matrix. For example, value of the element y 1 of a factor vector FV 1 is calculated by taking average of all the values in the first row of the semantic similarity matrix 212 b i.e., O 11 to O 1n .
- the factor vector FV 1 is composed of elements y 1 to y n each of which is calculated by performing similar row average operations on corresponding rows of the similarity matrix.
- FIG. 2B shows calculation of factor vector FV 6 using format similarity matrix 218 b .
- similarity measures such as semantic and syntactical similarity
- dot products or inner products of vector representations of unique field values of each field in the fielded dataset are calculated with every other unique field value in the same field of the fielded dataset to generate scalars.
- Row averages or weighted row averages of scalars are used to calculate values of the elements of the factor vectors.
- scalar values are generated by comparing every unique field value in a field with every other unique field value in the same field of the fielded dataset using various underlying algorithms.
- the scalar values are arranged in n ⁇ n matrices.
- the factor vector values are calculated in the same manner as above by taking row averages or weighted row averages of the scalars in corresponding rows of the similarity matrix.
- n ⁇ n matrices are not generated, rather Z-score for each unique field value in a particular field is calculated and used as elements in the respective factor vector. Examples of such similarity measures include length and frequency of occurrence similarity measures as explained above.
- the factor vectors for all similarity measures are arranged column-wise to generate an input matrix 232 .
- the input matrix 232 in FIG. 2B has six factor vectors FV 1 to FV 6 corresponding to six similarity measures described above.
- Each row of the input matrix 232 corresponds to a unique field value in a particular field of the fielded dataset 202 .
- an element of a factor vector for a given similarity measure specifies a likelihood that a corresponding unique field value in a field of the fielded dataset is anomalous in a context of the given similarity measure and conditioned on respective similarity measure values of other unique field values in the particular field of the fielded dataset for the given similarity measure.
- a row of the input matrix 232 represents a vector that encodes a likelihood that a corresponding unique field value in a particular field of the fielded dataset is anomalous in a context of the plurality of similarity measures and conditionable on respective similarity measure values of other unique field values of the particular of the fielded dataset for the plurality of linguistic similarity measures.
- FIG. 3 illustrates processing of the input matrix 232 by convolutional neural network (CNN) of the anomaly detection network 114 .
- the CNN can be a one-layer network or a two-layer network depending on the implementation.
- one set of filters are applied to the input matrix 232 .
- two sets of filters are applied to the input matrix 232 .
- additional layers can be added to the CNN.
- 64 filters are row-wise convolved over the input matrix 232 .
- a row-wise convolution of a filter on the input matrix 232 results in an evaluation vector (also referred to as a feature map). Note that the size of the filter is 1 ⁇ k where k is the number of similarity measures.
- filter 1 312 in FIG. 3 is convolved over the first row (y 1 , p 1 , q 1 , s 1 , t 1 , z 1 ) of the input matrix 232 to generate a scalar “a 0 ” at index position “0” of the evaluation vector EV 1 322 .
- Convolving the filter 1 , 312 over second row (y 2 , p 2 , q 2 , s 2 , t 2 , z 2 ) of the input matrix 232 generates a scalar “a 1 ” at index position “1” of the evaluation vector EV 1 322 .
- evaluation vector EV 1 322 corresponding to “n” rows in the input matrix 232 .
- a second evaluation vector EV 2 324 is generated by row-wise convolving a filter 2 314 over the input matrix 232 .
- the result of this convolution are scalars “b 0 ” to “b n ” in evaluation vector EV 2 324 .
- Sixty four (64) evaluation vectors (or feature maps) EV 1 322 to EV 64 326 are generated by convolving sixty four filters, filter 1 312 to filter 64 316 over the input matrix 232 .
- the evaluation vectors EV 1 322 to EV 64 326 are provided as input to a fully connected (FC) neural network 332 to accumulate element-wise weighted sums of the evaluation vectors (feature maps) in an output vector 342 .
- FC fully connected
- the first element FC[ 0 ] of the output vector 342 is calculated as weighted sums of corresponding elements in all of the evaluation vectors (feature maps) i.e. W 1 .EV 1 [ 0 ]+W 2 .EV 2 [ 0 ]+ . . . +W 64 .EV 64 [ 0 ].
- the output vector 342 has “n” elements corresponding to the “n” rows of the input matrix 232 .
- a nonlinearity function 352 is applied to the output vector 342 to produce a normalized output vector 362 .
- nonlinearity functions include sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU) and leaky ReLU.
- a threshold 372 is applied to the normalized output vector 362 to determine anomalous data values.
- the elements of the output vector 382 represent inverse similarity of corresponding field values in a given field of the fielded dataset. The higher the value of an element of the output vector 382 , the higher the likelihood that corresponding field value in the particular field is anomalous. For example, as shown in FIG. 3 , the second element O[ 1 ] 392 in the output vector 382 is above the threshold 372 . Therefore, the second field value (x 2 ) of the particular field in the dataset 202 is identified as anomalous by the anomaly detection network 114 .
- the training of the FC neural network 332 a forward pass of the anomaly detection network 114 is illustrated.
- the results of the output vector 382 are compared with ground truth.
- the anomalous field value x 2 (corresponding to O[ 1 ] 392 ) is compared with the correct field value to determine whether the anomaly detection network correctly identified the anomalous word token in the field.
- two cost functions are used to update the weights in the fully connected neural network FC 332 .
- the reason for using two cost functions is to avoid the anomaly detection network 114 from moving towards a everything-non-anomaly solution since anomalies are a small portion of a typical field of the fielded dataset.
- An additional benefit of using two cost functions is to use different learning rates to achieve a balance between anomaly and non-anomaly detection. This results in more accurate detection of anomalous field values as anomalous as well as more accurate detection of non-anomalous field values as non-anomalous.
- the training data for the anomaly detection network 114 is automatically generated by constructing positive and negative examples for inclusion in the training database 142 .
- a first set of field values are identified from a vocabulary which are similar to each other.
- a second set of field values are identified from the vocabulary that are dissimilar to each of the field values in the first set and the field values selected so far in the second set.
- the training dataset for the given linguistic similarity measure is generated by randomly selecting some field values from the first and second sets as positive and negative examples respectively. This process is repeated for the five linguistic similarity measures described above: semantic similarity, syntactical similarity, soundex similarity, length similarity and format similarity.
- frequency similarity the system randomly multiplies each of the unique input field value to increase its frequency of occurrence in the particular field of the fielded dataset.
- FIG. 4A shows a high level view of an anomaly detection network 114 with one convolution layer while FIG. 4B shows the same with two convolution layers.
- FIG. 5A illustrates operation of the input generator 112 to generate input data for suggestion network 116 .
- the input generator 112 takes a dataset 502 as input. During training the dataset 502 is generated from the training database 142 and during production, the dataset 502 is generated from the testing database 144 .
- the dataset 502 is composed of field values. In the more common use case, these values are word tokens. In other implementations, these can be character tokens or phrase tokens. An example of a word token is “John” and an example of a phrase token is “J. P. Morgan”.
- the dataset 202 is a fielded dataset which means it can have fields such as “First Name”, “Last Name”, “Company Name”, “Country” etc., and data is organized as field values in the fields.
- the input dataset 502 contains field values x 1 to x n that are non-anomalous.
- the input dataset 502 contain the field values that have been processed by the anomaly detection network 114 .
- the anomalous values have been identified and removed from the input dataset 502 .
- the input dataset 502 also contains an input value also referred to as a target label (TL).
- the suggestion network 116 suggests one or more unique field values from the “n” non-anomalous field values to replace the target label (TL).
- the input generator 112 generates factor vectors 530 for a plurality of similarity measures (also referred to as linguistic similarity measures) for the suggestion network 116 .
- Suggestion network 116 uses the same six similarity measures as anomaly detection network 114 : semantic similarity, syntactic similarity, soundex similarity, format similarity, length similarity, and frequency similarity.
- a factor vector calculator 210 contains similarity measure calculators for each of the similarity measures.
- FIG. 2A shows six similarity measure calculators corresponding to six similarity measure listed above. The calculations of similarity measures is the same as explained above in the anomaly detection network 114 .
- the semantic similarity calculator 212 calculates semantic similarity measure
- the syntactical similarity calculator 214 calculates syntactical similarity measure
- the soundex similarity calculator 216 calculates soundex similarity measure
- the length similarity calculator 218 calculates length similarity measure
- the frequency of occurrence similarity calculator 220 calculates frequency similarity measure
- the format similarity measure calculator 222 calculates format similarity measure.
- anomaly detection network 114 similarity measures are calculated for every unique field value with every other unique field value in a particular field of the fielded dataset.
- suggestion network 116 similarity measures are calculated for each unique input field value with the target label (TL).
- the most dissimilar input field value is recommended as anomalous.
- suggestion network 116 the most similar input field value (to the target label) is recommended to replace the target label.
- more than one similar input field values are recommended as replacement values for the target label. Further evaluation of the recommended input field values is performed by an expert to select one unique field value to replace the target label.
- the six similarity calculators generate six factor vectors 530 corresponding to the six similarity measures.
- the six factor vectors are denoted by the variables “y”, “p”, “q”, “s”, “t”, and “z”.
- Each factor comprises as many elements as words in a field of the fielded dataset.
- Each word is also referred to as a field value.
- Each of the six factor vectors 530 consists of “n” values.
- the first factor vector consists of the values y 1 to y n
- the second factor vector consists of the values p 1 to p n
- the third factor vector consists of the values q 1 to q n
- the fourth factor vector consists of the values s 1 to s n
- the fifth factor vector consists of the values t 1 to t n
- the sixth factor vector consists of the values z 1 to z n .
- the six example factor vectors 530 shown in FIG. 5A form an input matrix which is shown in FIG. 5B referred to by a numeral 532 .
- the input matrix 532 is given as input to the suggestion network 116 by the input generator 112 .
- FIG. 5B shows further details of how factor vectors are calculated for similarity measures listed above.
- factor vector calculations of two similarity measures are shown as an example.
- the semantic similarity matrix 542 and format similarity matrix 548 compare only the target label (TL) with the input field values x 1 to x n .
- TL target label
- FIG. 5B rely on matrix multiplications.
- the first similarity matrix 542 is for calculation of semantic similarity measure. It is an n ⁇ 1 matrix where n is the number of unique field values in a field in the fielded dataset.
- An inner product (also referred to as a dot product) is calculated between word embedding vector of each unique field value in the field with word embedding vector of target label. These embedding vectors are provided by either Word2Vec word embedding space, GloVe word embedding space, WordNet or any other low dimensional embedding space.
- the inner product between the word embedding vectors of two unique field values produces scalar values which are represented by variable O a where “a” is the row index of the semantic similarity matrix 542 .
- the elements of the factor vector correspond to the rows of the similarity matrix.
- value of the element y 1 of a factor vector FV 1 is equal to O 1 .
- the factor vector FV 1 is composed of elements y 1 to y n each of which is calculated by performing similar row operations on corresponding rows of the similarity matrix.
- similarity matrices are calculated for other similarity measures.
- similarity measures such as semantic and syntactical similarity
- dot products or inner products of vector representations of field values of each field in the fielded dataset are calculated with target label to generate scalars.
- Factor vectors are generated using the row values in similarity matrices.
- similarity measures such as soundex and format similarity measures
- scalar values are generated by comparing every field value in a field with the target label using various underlying algorithms.
- the scalar values are arranged in n ⁇ 1 matrices.
- the factor vector values are calculated in the same manner as above by using values of the scalars in corresponding rows of the similarity matrix.
- n ⁇ 1 matrices are not generated, rather Z-score for each field value in a field is calculated and used as elements in the corresponding factor vector. Examples of such similarity measures include length and frequency of occurrence similarity measures.
- the factor vectors for all similarity measures are arranged column-wise to generate an input matrix 532 .
- the input matrix 532 in FIG. 5B has six factor vectors FV 1 to FV 6 corresponding to six similarity measures described above. Each row of the input matrix 532 corresponds to a field value in a field of the fielded dataset 202 .
- an element of the factor vector for the given similarity measure specifies a likelihood that a corresponding unique field value in the dataset is similar to the target label in a context of the given similarity measure and conditioned on respective similarity measure values of other unique field values in the dataset for the given similarity measure.
- a row of the input matrix represents a vector that encodes a likelihood that a corresponding unique field value in the dataset is similar to the target label in a context of the plurality of linguistic similarity measures and conditionable on respective similarity measure values of other unique field values in the dataset for the plurality of linguistic similarity measures.
- FIG. 6 illustrates processing of the input matrix 532 by convolutional neural network (CNN) of the suggestion network 116 .
- the CNN is a one-layer network.
- one set of filters are applied to the input matrix 532 .
- additional layers can be added to the CNN.
- 64 filters are row-wise convolved over the input matrix 532 .
- a row-wise convolution of a filter on the input matrix 532 results in an evaluation vector (also referred to as a feature map).
- filter 1 612 in FIG. 6 is convolved over the first row (y 1 , p 1 , q 1 , s 1 , t 1 , z 1 ) of the input matrix 532 to generate a scalar “a 0 ” at index position “0” of the evaluation vector EV 1 622 .
- Convolving the filter 1 , 612 over second row (y 2 , p 2 , q 2 , s 2 , t 2 , z 2 ) of the input matrix 532 generates a scalar “a 1 ” at index position “1” of the evaluation vector EV 1 .
- a second evaluation vector EV 2 624 is generated by row-wise convolving a filter 2 614 over the input matrix 532 .
- the result of this convolution are scalars “b 0 ” to “b n ” in evaluation vector EV 2 624 .
- Sixty four (64) evaluation vectors (or feature maps) EV 1 622 to EV 64 626 are generated by convolving sixty four filters, filter 1 612 to filter 64 616 over the input matrix 532 .
- the evaluation vectors EV 1 622 to EV 64 626 are provided as input to a fully connected (FC) neural network 632 to accumulate element-wise weighted sums of the evaluation vectors (feature maps) in an output vector 642 .
- FC fully connected
- the first element FC[ 0 ] of the output vector 642 is calculated as weighted sums of corresponding elements in all of the evaluation vectors (feature maps) i.e. W 1 .EV 1 [ 0 ]+W 2 .EV 2 [ 0 ]+ . . . +W 64 .EV 64 [ 0 ].
- the output vector 642 has “n” elements corresponding to the “n” rows of the input matrix 532 .
- a nonlinearity function 652 is applied to the output vector 642 to produce a normalized output vector 662 .
- nonlinearity functions include sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU) and leaky ReLU.
- a threshold 672 is applied to the normalized output vector 662 to determine similar data values.
- the elements of the output vector 682 represent similarity of corresponding field values in a given input field of the fielded dataset to the target label. The higher the value of an element of the output vector 682 , the higher the likelihood that corresponding field values in the input field is similar to the target label. For example, as shown in FIG. 6 , the first element in the output vector is above the threshold 672 .
- this element (x 1 ) of the input field in the dataset 502 is recommended by the suggestion network as a replacement for target label.
- multiple input field values can be recommended by the suggestion network 116 .
- An expert can select one input field value from the suggested values to replace the target label.
- the training of the FC neural network 632 a forward pass of the suggestion network 116 is described.
- the results of the output vector 682 are compared with ground truth.
- the suggested field value 682 is compared with the correct value to determine whether the suggestion network correctly identified the replacement field value in the field.
- one cost function is used to update the weights in the fully connected neural network FC 632 .
- the first element O[ 0 ] 692 in the output vector 682 is above the threshold 372 . Therefore, the first field value (x 1 ) of the particular field in the dataset 502 is used to replace the anomalous input field or the target label (TL) by the suggestion network 116 .
- the training data for the suggestion network 116 is automatically generated by constructing positive and negative examples for inclusion in the training dataset.
- a first set of field values are identified from a vocabulary which are similar to each other.
- a second set of field values are also identified from the vocabulary that are dissimilar to each of the field values in the first set and the field values selected so far in the second set.
- the training dataset for the given linguistic similarity measure is generated by randomly selecting some field values from the first and second sets as positive and negative examples respectively. This process is repeated for the five linguistic similarity measures described above: semantic similarity, syntactical similarity, soundex similarity, length similarity and format similarity.
- For frequency similarity the system randomly multiplies each of the unique field value to increase its frequency of occurrence in the particular field of the fielded dataset.
- a target label is randomly selected from a set of anomalous data values.
- FIG. 7 shows a high level view of a suggestion network 116 with one convolution layer.
- FIG. 8 is a simplified block diagram 800 of a computer system 810 that can be used to implement the machine learning system 110 .
- Computer system 810 typically includes at least one processor 814 that communicates with a number of peripheral devices via bus subsystem 812 .
- peripheral devices can include a storage subsystem 824 including, for example, memory devices and a file storage subsystem, user interface input devices 822 , user interface output devices 820 , and a network interface subsystem 816 .
- the input and output devices allow user interaction with computer system 810 .
- Network interface subsystem 816 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
- User interface input devices 822 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
- pointing devices such as a mouse, trackball, touchpad, or graphics tablet
- audio input devices such as voice recognition systems and microphones
- use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 810 .
- User interface output devices 820 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem can include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem can also provide a non-visual display such as audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computer system 810 to the user or to another machine or computer system.
- Storage subsystem 824 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processor 814 alone or in combination with other processors.
- Memory subsystem 826 used in the storage subsystem can include a number of memories including a main random access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored.
- a file storage subsystem 828 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations can be stored by file storage subsystem 828 in the storage subsystem 824 , or in other machines accessible by the processor.
- Bus subsystem 812 provides a mechanism for letting the various components and subsystems of computer system 810 communicate with each other as intended. Although bus subsystem 812 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
- Computer system 810 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computer system 810 depicted in FIG. 8 is intended only as one example. Many other configurations of computer system 810 are possible having more or fewer components than the computer system depicted in FIG. 8 .
- the technology disclosed relates to detection of anomalous field values for a particular field in a fielded dataset.
- the technology disclosed can be practiced as a system, method, or article of manufacture.
- One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable.
- One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
- a first system implementation of the technology disclosed includes one or more processors coupled to the memory.
- the memory is loaded with computer instructions to detect an anomalous field value.
- the system determines which field values for a particular field in a fielded dataset are anomalous.
- the system compares a particular unique field value to the other unique field values for the particular field by applying a plurality of similarity measures and generates a factor vector that has one scalar for each of the unique field values.
- the system compares the factor vector using convolution filters in a convolutional neural network (abbreviated CNN) to generate evaluation vectors (also referred to as feature maps).
- the system further evaluates the evaluation vectors using a fully connected (abbreviated FC) neural network to produce an anomaly scalar for the particular unique field value.
- a threshold is applied to the anomaly scalar to determine whether the particular unique field value is anomalous.
- System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- the system uses a plurality of similarity measures including semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
- the system further includes determining that one or more field values in the fielded dataset are similar to an input value (also referred to as a target label) for a particular field.
- the system compares a particular input value to the unique field values for the particular input field by applying a plurality of similarity measures. This results in generation of a factor vector that has one scalar for each of the unique field values.
- the system evaluates the factor vector using the convolution filters in the CNN to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values.
- the system uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- the system includes calculating factor vectors for some of the similarity measures by calculating an inner product between respective similarity measure values of the unique field values in the dataset to form a similarity matrix.
- a row-wise average of the inner product results in the similarity matrix is calculated.
- a factor vector is formulated for the given similarity measure by arranging the row-wise averages as elements of the factor vector.
- An element of the factor vector the given linguistic similarity measure specifies a likelihood that a corresponding unique field value in the dataset is anomalous in a context of the given linguistic similarity measure. Additionally, the element of the factor vector is also conditioned on respective similarity measure values of other unique values in the dataset for the given linguistic similarity measure.
- the system generates an input for the convolutional neural network (abbreviated CNN) by column-wise arranging the factor vectors in an input matrix.
- the convolution filters apply row-wise on the input matrix.
- a row in the input matrix represents a vector that encodes a likelihood that a corresponding unique field value in the dataset is anomalous in a context of the plurality of linguistic similarity measures and conditionable on respective similarity measure values of other unique field values in the dataset for the plurality of linguistic similarity measures.
- the system automatically constructs positive and negative examples for inclusion in a training dataset. For a given linguistic similarity measure, the system constructs the training dataset by determining a first set of similar field values from a vocabulary and determines a second set of dissimilar field values from the vocabulary. The system then randomly selects some field values from the first set as positive examples and the second set as negative examples respectively. The system repeats the above process for each linguistic similarity measure to determine and select positive and negative training examples. The system stores the randomly selected field values for the plurality of similarity measures as the training dataset.
- the system trains the convolutional neural network (CNN) and the fully connected (FC) neural network using the positive and negative examples in the training dataset.
- CNN convolutional neural network
- FC fully connected
- the system uses at least two cost functions to evaluate performance of the CNN and the FC neural network during training.
- a first cost function evaluates classification of unique field values as anomalies and a second cost function evaluates classification of unique field values as non-anomalies.
- the system calculates separate gradients for the two cost functions and backpropagates the gradients to the CNN and the FC neural network during training.
- the convolutional neural network is a one-layer CNN. In another implementation of the system, the convolutional neural network is a two-layer CNN.
- a second system implementation of the technology disclosed includes one or more processors coupled to the memory.
- the memory is loaded with computer instructions to detect linguistically anomalous field values in a dataset.
- the system calculates at least one factor vector for each of a plurality of linguistic similarity measures. For each linguistic similarity measure, the system calculates its factor vector by averaging product results and/or distribution values calculated from similarity measure values of unique field values in the dataset for the given linguistic similarity measure.
- the factor vectors are provided as input to a convolutional neural network (abbreviated CNN).
- CNN convolutional neural network
- the system applies convolution filters to the factor vectors to generate evaluation vectors (also referred to as feature maps).
- the system provides the evaluation vectors as input to a fully-connected (abbreviated FC) neural network to accumulate element-wise weighted sums of the evaluation vectors in an output vector.
- FC fully-connected
- the system applies a nonlinearity function to the output vector to produce a normalized output vector.
- the system applies thresholding to the normalized output vector to identify anomalous and similar field values in the dataset.
- implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above.
- implementations may include a method performing the functions of the system described above.
- a first method implementation of the technology disclosed includes detecting anomalous field values.
- the method includes, determining which field values for a particular field in a fielded dataset are anomalous.
- a particular unique field value is compared to other unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values.
- the method compares the factor vector using convolution filters in a convolutional neural network (abbreviated CNN) to generate evaluation vectors (also referred to as feature maps).
- the method further evaluates the evaluation vectors using a fully connected (abbreviated FC) neural network to produce an anomaly scalar for the particular unique field value.
- a threshold is applied to the anomaly scalar to determine whether the particular unique field value is anomalous.
- implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the first method described above.
- implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the first method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
- a second method implementation of the technology disclosed includes detecting linguistically anomalous field values in a dataset.
- the method includes calculating at least one factor vector for each of a plurality of linguistic similarity measures. For a given linguistic similarity measure, the method calculates its factor vector by averaging product results and/or distribution values calculated from similarity measure values of unique field values in the dataset for the given linguistic similarity measure.
- the method includes providing the factor vectors as input to a convolutional neural network (abbreviated CNN) and applying convolution filters to the factor vectors to generate evaluation vectors (also referred to as feature maps).
- CNN convolutional neural network
- the method includes providing the evaluation vectors as input to a fully-connected (abbreviated FC) neural network to accumulate element-wise weighted sums of the evaluation vectors in an output vector.
- FC fully-connected
- the method includes applying a nonlinearity function to the output vector to produce a normalized output vector.
- the method includes thresholding the normalized output vector to identify anomalous and similar field values in the dataset.
- implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the method described above.
- implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the second method described above.
- a third system implementation of the technology disclosed includes one or more processors coupled to the memory.
- the memory is loaded with computer instructions to suggest one or more candidates for a particular input value.
- the system determines that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset.
- the system performs this determination by comparing a particular input value to unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values.
- the system evaluates the factor vector using convolution filters in a convolutional neural network (CNN) to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values.
- the system further evaluates the evaluation vectors using a fully-connected (FC) neural network to produce suggestion scalars for similarity to the particular input value.
- the system uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- CNN convolutional neural
- System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- the system uses a plurality of similarity measures including semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
- the system constructs an input to the CNN by column-wise arranging one or more factor vectors in an input matrix.
- the convolution filters apply row-wise on the input matrix.
- the system automatically constructs positive and negative examples for inclusion in a training dataset. For a given linguistic similarity measure, the system determines a first set of similar field values from a vocabulary and determines a second set of dissimilar field values from the vocabulary. Following this, the system randomly selects some field values from the first and second sets as positive and negative examples respectively. The system repeats the above process of determining the first set and the second set of field values for a plurality of similarity measures. Finally, the system stores the randomly selected field values for the plurality of similarity measures as the training dataset.
- system further includes training the CNN and the FC neural network using the positive and negative examples in the training dataset.
- the system uses at least one cost function to evaluate performance of the CNN and the FC neural network during training.
- the convolutional neural network is a one-layer CNN. In another implementation of the system, the convolutional neural network (CNN) is a two-layer CNN.
- implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above.
- implementations may include a method performing the functions of the system described above.
- the system includes, determining which field values for a particular field in the fielded dataset are anomalous. The system performs this determination by comparing a particular unique field value to other unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values. Following this, the system evaluates the factor vector using the convolution filters in the CNN to generate evaluation vectors (also referred to as feature maps). The system further evaluates the evaluation vectors using the FC neural network to produce an anomaly scalar for the particular unique field value. Finally, the system applies thresholding to the anomaly scalar to determine whether the particular unique field value is anomalous.
- a third method implementation of the technology disclosed includes suggesting one or more candidates for a particular input value.
- the method includes, determining that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset.
- the method performs this determination by comparing a particular input value to unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values.
- the method evaluates the factor vector using convolution filters in a convolutional neural network (CNN) to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values.
- the method further evaluates the evaluation vectors using a fully-connected (FC) neural network to produce suggestion scalars for similarity to the particular input value.
- the method uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the third method described above.
- implementations may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the first method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
- the technology disclosed, and particularly the anomaly detection network 114 and the suggestion network 116 can be implemented in the context of any computer-implemented system including a database system, a multi-tenant environment, or a relational database implementation like an OracleTM compatible database implementation, an IBM DB2 Enterprise ServerTM compatible relational database implementation, a MySQLTM or PostgreSQLTM compatible relational database implementation or a Microsoft SQL ServerTM compatible relational database implementation or a NoSQLTM non-relational database implementation such as a VampireTM compatible non-relational database implementation, an Apache CassandraTM compatible non-relational database implementation, a BigTableTM compatible non-relational database implementation, or an HBaseTM or DynamoDBTM compatible non-relational database implementation.
- a relational database implementation like an OracleTM compatible database implementation, an IBM DB2 Enterprise ServerTM compatible relational database implementation, a MySQLTM or PostgreSQLTM compatible relational database implementation or a Microsoft SQL ServerTM compatible relational database implementation or a NoSQLTM
- the technology disclosed can be implemented using different programming models like MapReduceTM, bulk synchronous programming, MPI primitives, etc., or different scalable batch and stream management systems like Amazon Web Services (AWS)TM, including Amazon Elasticsearch ServiceTM and Amazon KinesisTM, Apache StormTM Apache SparkTM, Apache KafkaTM, Apache FlinkTM, TruvisoTM, IBM Info-SphereTM, BorealisTM and Yahoo! S4TM.
- AWS Amazon Web Services
- Apache StormTM Apache SparkTM Apache KafkaTM
- Apache FlinkTM Apache FlinkTM
- TruvisoTM IBM Info-SphereTM
- BorealisTM Yahoo! S4TM.
- a computer-readable storage medium which may be any device or medium that can store code and/or data for use by a computer system.
- ASICs application-specific integrated circuits
- FPGAs field-programmable gate arrays
- magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The technology disclosed determines which field values in a set of unique field values for a particular field in a fielded dataset are anomalous using six similarity measures. A factor vector is generated per similarity measure and combined to form an input matrix. A convolutional neural network processes the input matrix to generate evaluation vectors. A fully-connected network evaluates the evaluation vectors to generate an anomaly scalar for a particular unique field value. Thresholding is applied to anomaly scalar to determine whether the particular unique field value is anomalous.
Description
- This application is a continuation of and claims priority under 35 U.S.C. 120 to co-pending and commonly-owned U.S. nonprovisional application Ser. No. 15/726,267, filed Oct. 5, 2017, which is hereby expressly incorporated herein by reference in its entirety.
- This application is related to U.S. application Ser. No. 15/726,268, filed Oct. 5, 2017. The related application is incorporated by reference herein.
- The technology disclosed relates generally to data cleaning apparatus and corresponding methods for the detection of anomalous data stored in a fielded database or as computer files, and in particular relates to implementing a convolutional neural network that takes identifies the anomalous field values and suggests correct field values to replace anomalous field values.
- The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology.
- A large amount of data is being generated by today's business systems. Organizations use this data as input to business intelligence (BI) systems. These systems require high quality input data. Current data cleaning techniques require significant human intervention to clean data. Moreover, typically only one criterion or a limited number of criteria can be applied at any given input field of a fielded dataset to identify anomalous field values. Therefore, existing data cleaning techniques are not scalable and efficient. An opportunity arises to develop a data cleaning technique that does not require significant human intervention and is scalable. Effective and efficient data cleaning may result.
- The following are attached hereto as part of a single invention:
- Appendix A is a presentation showing examples of data structures, algorithms, neural network architectures, and experimental results at a high-level.
- Appendix B is a report showing examples of data structures, algorithms, neural network architectures, and experimental results in greater detail.
- In the drawings, like reference characters generally refer to like parts throughout the different views. Also, the drawings are not necessarily to scale, with an emphasis instead generally being placed upon illustrating the principles of the technology disclosed. In the following description, various implementations of the technology disclosed are described with reference to the following drawings, in which:
-
FIG. 1 shows aspects of a system in which a machine learning system detects anomalous field values and suggests correct field values to replace anomalous field values. -
FIG. 2A depicts one implementation of an input generator for anomaly detection network ofFIG. 1 using six similarity calculators. -
FIG. 2B shows calculations for determining factor vectors using two example similarity measures to generate an input matrix. -
FIG. 3 shows processing of the input matrix ofFIG. 2B by anomaly detection network ofFIG. 1 to identify anomalous field values. -
FIG. 4A shows examples of main components of a one-layer anomaly detection network. -
FIG. 4B shows examples of main components of a two-layer anomaly detection network. -
FIG. 5A is an example of an input generator for suggestion network ofFIG. 1 using six similarity calculators. -
FIG. 5B shows calculations for determining factor vectors using two example similarity measures to generate an input matrix. -
FIG. 6 shows processing of the input matrix ofFIG. 5B by suggestion network ofFIG. 1 to suggest correct field value for the anomalous field value. -
FIG. 7 shows examples of main components of a one-layer suggestion network ofFIG. 1 . -
FIG. 8 is a simplified block diagram of a computer system that can be used to implement the machine learning system ofFIG. 1 . - The following discussion is presented to enable any person skilled in the art to make and use the technology disclosed, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed implementations will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
- More and more companies are using data to help drive their business decisions. Data is collected from a variety of sources including sales processes, accounting systems, call centers, etc. Business intelligence (BI) systems analyze data from a variety of sources. Before data is consumed by BI systems, it needs to be cleaned to make sure any conclusion based on the data is reliable. Data Scientists typically spend more than 70% of the time cleaning the data before doing any data analysis. Large amounts of data generated by today's business systems make manual data cleaning process very expensive. A convolutional neural network (CNN) is trained to identify anomalous field values in a fielded dataset. The CNN speeds up the data cleaning process by reducing the effort required to clean the data by automatically detecting anomalous field values.
- The CNN uses hints based on six similarity measures, which cover both semantic and syntactic aspects of fielded data, to identify anomalous field values in a particular input field. The six similarity measures are semantic similarity, syntactical similarity, soundex similarity, length similarity, frequency of occurrence similarity and format similarity. A key element of the technology disclosed is extraction of hints from the fielded data for the six similarity measures and organizing these hints in the form of vectors that can be processed by the CNN.
- The training data for the CNN is generated artificially. Separate datasets are generated for five of the similarity measures listed above except the frequency of occurrence similarity measure. Each dataset contains separate collection of similar and dissimilar unique field values. A training dataset is generated by randomly picking field values from similar and dissimilar field values. To generate frequency similarity data, each of the unique field values in the field is multiplied with a randomly generated positive integer. For production, real world testing data is used from public data sources. The higher the quality of input data for training, the higher the accuracy of the CNN in production.
- Data cleaning of a data field is composed of two processes: anomaly detection and anomaly suggestion. A separate CNN is used for each of these two processes: an anomaly detection network and a suggestion network. The respective CNNs are trained separately. Anomaly detection CNN can reach 97% detection for anomaly and 88% detection for non-anomaly on training set, and 60%, 83% detections respectively on testing set. Suggestion CNN can reach 60% accuracy on training set and 70% on testing set.
- There are two ways to apply the anomaly detection and anomaly suggestion networks to identify anomalous data and clean it. The first option is automatic detection-suggestion flow. In the first option, the anomalous field values identified by the anomaly detection network are automatically passed to the suggestion network. The suggestion network identifies a correct field value for each anomalous field value and replaces the anomalous field value with correct field value. The second option to combine the anomaly detection and suggestion networks is to get expert feedback after the detection of anomalous field values. This is to avoid the passing false positives to the suggestion network. This helps the user to understand the results and provide feedback. The second option to combine the anomaly detection and suggestion networks requires more interaction from an expert, but generally gains higher accuracy.
- We describe a system for detecting anomalous field values for a particular field in a fielded dataset. The system and processes are described with reference to
FIG. 1 showing an architectural level schematic of a system in accordance with an implementation. BecauseFIG. 1 is an architectural diagram, certain details are intentionally omitted to improve the clarity of the description. - The discussion of
FIG. 1 is organized as follows. First, the elements of the figure are described, followed by their interconnections. Then, the use of the elements in the system is described in greater detail. -
FIG. 1 includes thesystem 100. Thesystem 100 includes amachine learning system 110, anetwork 120, atraining database 142, atesting database 144 and anadministrator interface 132. Themachine learning system 110 includes aninput generator 112, ananomaly detection network 114 and asuggestion network 116. At a high level, theinput generator 112 generates input data for theanomaly detection network 114 and thesuggestion network 116. The data that is consumed by theinput generator 112 is provided by thetraining database 142 during training and by thetesting database 144 during production. - The
anomaly detection network 114 determines which field values in a set of field values for a particular field in a fielded dataset are anomalous. Thesuggestion network 116 determines that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset. Bothanomaly detection network 114 andsuggestion network 116 can run on a variety of hardware processors such as graphic processor units (GPUs). Neural network-based models involve computationally intensive methods, such as convolutions and matrix-based operations. GPUs are well suited for these types of computations. Recently, specialized hardware is being developed to efficiently train neural network models. - During training of the
anomaly detection network 114 and thesuggestion network 116, theinput generator 112 receives data from thetraining database 142 and processes it to generate input for theanomaly detection network 114 and thesuggestion network 116. During production, theinput generator 112 receives data from thetesting database 144 and processes it to produce input for theanomaly detection network 114 and thesuggestion network 116. - The
administrative interface 132 can be used during training or production to provide input data to theinput generator 112 from respective databases. Theadministrative interface 132 can also be used during training to control model parameters of theanomaly detection network 114 and thesuggestion network 116. Themachine learning system 110, thetraining database 142, thetesting database 144 and theadministrative interface 132 are in communication with each other via the network(s) 120. After presenting a high level description of thesystem 100, the discussion now turns to detailed description of various components of thesystem 100. -
FIG. 2A illustrates operation of theinput generator 112 to generate input data for theanomaly detection network 114 during training. Theinput generator 112 takes adataset 202 as input. During training, thedataset 202 is generated from thetraining database 142 and during production, thedataset 202 is generated from thetesting database 144. Thedataset 202 is composed of field values. In the more common use case, these field values are word tokens. In other implementations, field values can be character tokens or phrase tokens. An example of a word token is “John” and an example of a phrase token is “J. P. Morgan”. Thedataset 202 is a fielded dataset which means it can have fields such as “First Name”, “Last Name”, “Company Name”, “Country” etc., and data is organized as field values in the fields. - The
input generator 112, generatesfactor vectors 230 for a plurality of similarity measures (also referred to as linguistic similarity measures). In one implementation, six similarity measures are used. The six similarity measures are semantic similarity, syntactical similarity, soundex similarity, length similarity, frequency of occurrence similarity and format similarity. In other implementations fewer than six or more than six similarity measures can be used. Afactor vector calculator 210 contains similarity measure calculators for each similarity measure.FIG. 2A shows six similarity measure calculators corresponding to six similarity measures listed above. Asemantic similarity calculator 212 calculates semantic similarity measure, asyntactical similarity calculator 214 calculates syntactical similarity measure, asoundex similarity calculator 216 calculates soundex similarity measure, alength similarity calculator 218 calculates length similarity measure, a frequency of occurrence similarity calculator 220 calculates frequency similarity measure and a format similarity calculator 222 calculates format similarity measure. - Semantic similarity between two field values of a particular input field of the fielded dataset is calculated using either Word2Vec, WordNet or GloVe model. Two given input field values of the input field are semantically similar to each other if their meaning are similar in the vocabulary. For example, consider a field listing fruit names, a first field value “apple” is semantically similar to a second field value “orange”. In the same field, the first field value “apple” is not semantically similar to a third field value “airplane”. Every field value of the particular field is compared with every other field value in the same field to determine semantic similarity measure between them. For this purpose, two field values are passed to one of the selected models listed above. The model represents each field value as a word vector in a low dimensional embedding space (typically between 100 to 1000 dimensions). An inner product or dot product of two multidimensional vector representations of the field values is calculated by the selected model. The results of the calculations are used to fill the n×n similarity matrix for semantic similarity measure, where “n” is the number of unique field values (also referred to as words, word phrases or tokens) in the field.
- Syntactic similarity between two field values of a particular field of the fielded dataset is calculated using a so called “bag of letters” algorithm. The syntactic similarity between two input field values indicates how similar they are in terms of the arrangement of the characters in the field values. For example, a first field value “gardenia” is syntactically similar to a second field value “grandniece” in a particular input field. The first input value “gardenia” is not syntactically similar to a third input value “hoff” in the same input field. The syntactic similarity is calculated between every two field values of a given field. For this purpose, two field values are given as input to the “bag of letters” algorithm. In a first step of the algorithm, all upper case letters in the two field values are converted to respective lower case letters. The characters in both field values are then converted to equivalent American Standard Code for Information Interchange (ASCII) codes. In the next step, the algorithm represents each input field value as a multidimensional vector in a character embedding space. The dimensionality of the embedding space is equal to number of codes in ASCII. Multiple occurrences of the same character are represented as magnitude on that character's respective dimension. Finally, the “bag of letters” algorithm, calculates an inner product or dot product of the multidimensional vector representations of the two input field values. The results of the calculations are used to fill the n×n similarity matrix for syntactic similarity measure where “n” is the number of unique field values in the particular field.
- Soundex similarity between two field values of a particular field indicates how similar they are in terms of their respective pronunciations in the English language. For example, a first field value “abel” is soundex similar to a second field value “flail” in a particular field. The first field value “abel” is not soundex similar to a third input field value “rowe” in the same field of the fielded dataset. In one implementation, a custom algorithm is used to determine the soundex similarity between two input field values by determining corresponding positions of similar sounding English alphabets in respective field values. In other implementations, soundex algorithms such as Metaphone or Double Metaphone can be used to determine soundex similarity between two input field values. A soundex similarity score of each field value is calculated with every other field value in a particular input field of the fielded dataset. The results of the calculations are used to fill the n×n similarity matrix for soundex similarity measure where “n” is the number of unique words or tokens in the input field.
- Format similarity between two input field values indicates how similar they are in terms of their formats. For example, a first field value “0x12345” is similar in format to a second field value “0x32749” in a particular field. The first field value “0x12345” is not format similar to a third field value “1234567” in the same input field of the fielded dataset. In one implementation, a custom format matching algorithm is given two field values of a particular field as input. The format matching algorithm calculates a format similarity score for the two field values by comparing characters at corresponding positions in the two field values. A format similarity score for each field value is calculated with every other field value in a particular input field of the fielded dataset. The results of the calculations are used to fill the n×n similarity matrix for format similarity measure where “n” is the number of unique field values in the particular field.
- Length similarity measure is used to identify field values that are longer or shorter than average length of field values in a particular field. For example, consider field values “firm”, “lamb”, “coon”, “key”, “septuagenarian”, “x” in a particular input field. The length similarity measure will identify the field values “septuagenarian”, and “x” as anomalous because their respective lengths are too long or too short than average length of field values in the particular input field. In one implementation, the algorithm to calculate the length similarity measure first generates the length of each field value. Following this, the algorithm determines a Z-score (also referred to as standard score) using mean and standard deviation. The Z-score is statistical measure of a score's relationship to the mean in a group of scores. The values of Z-scores are normalized using the largest Z-score value. The normalized values of Z-scores range between “0” and “1” inclusive. A normalized Z-score of zero means the score is the same as the mean which means the corresponding input field value lies at the mean. A higher normalized Z-score value implies the score is farther away from the mean. The corresponding field value is either too short in length or too long in length compared to other field values in the particular field. A normalized Z-score of “1” means the corresponding input field value is farthest from the mean and consequently had the highest inverse similarity measure. Finally, the field values with highest inverse similarity measure values are recommended as anomalous.
- Frequency similarity measure identifies anomalous field values using frequency counts of the unique field values in a particular field of the fielded dataset. Typically, in a field of a fielded dataset, unique values occur multiple times. If a particular field value occurs only once or a very few times, then most likely it is an anomaly. For example, if field values and their corresponding frequency of occurrences in a particular input field are: “J.P. Morgan” (135), “Goldman Sachs” (183), “Citi” (216), “Morgan Stanley” (126), “City” (1). The anomalous field value is “City”, which is the least frequent value with a frequency of occurrence of “1”.
- Turning to the example
factor vector calculator 210 shown inFIG. 2A , the six similarity calculators generate sixfactor vectors 230 corresponding to the six similarity measures. The six factor vectors are denoted by the variables “y”, “p”, “q”, “s”, “t”, and “z” respectively. Each factor comprises as many elements as unique field values in a field of the fielded dataset. As shown in theexample dataset 202, there are “n” unique field values: x1 to xn in a field. Each field value is also referred to as a word, word phrase or a token. Each of the sixfactor vectors 230 consists of “n” values or elements. For example, the first factor vector consists of the values y1 to yn, the second factor vector consists of the values p1 to pn, the third factor vector consists of the values q1 to qn, the fourth factor vector consists of the values s1 to sn, the fifth factor vector consists of the values t1 to tn and the sixth factor vector consists of the values z1 to zn. When arranged column-wise, the sixexample factor vectors 230 shown inFIG. 2A form an input matrix which is shown inFIG. 2B referred to by a numeral 232. Theinput matrix 232 is given as input to theanomaly detection network 114 by theinput generator 112. -
FIG. 2B shows further details of how factor vectors are calculated for similarity measures listed above. InFIG. 2B factor vector calculations of two similarity measures are shown as an example. A person skilled in the art will appreciate that other factor vectors may be calculated similarly, either using matrix multiplications or scalar interactions. The examples shown inFIG. 2B rely on matrix multiplications. - The
first similarity matrix 212 b is for semantic similarity measure. It is a n×n matrix where n is the number of unique field values in a field of the fielded dataset. An inner product (also referred to as a dot product) is calculated between word embedding vector of each unique field value in the field with word embedding vector of every other unique field value in the same field of the fielded dataset. These embedding vectors are generated by either Word2Vec word embedding space, GloVe word embedding space, WordNet or any other low dimensional embedding space. The inner product between the word embedding vectors of two unique field values produces scalar values which are represented by variable Oab where “a” is the row index and “b” is the column index of thesemantic similarity matrix 212 b. The elements of the factor vector are row averages of corresponding rows of the similarity matrix. For example, value of the element y1 of a factor vector FV1 is calculated by taking average of all the values in the first row of thesemantic similarity matrix 212 b i.e., O11 to O1n. The factor vector FV1 is composed of elements y1 to yn each of which is calculated by performing similar row average operations on corresponding rows of the similarity matrix. - Accordingly, similarity matrices are calculated for other similarity measures.
FIG. 2B shows calculation of factor vector FV6 usingformat similarity matrix 218 b. For some similarity measures such as semantic and syntactical similarity, dot products or inner products of vector representations of unique field values of each field in the fielded dataset are calculated with every other unique field value in the same field of the fielded dataset to generate scalars. Row averages or weighted row averages of scalars are used to calculate values of the elements of the factor vectors. For other similarity measures such as soundex and format similarity measures, scalar values are generated by comparing every unique field value in a field with every other unique field value in the same field of the fielded dataset using various underlying algorithms. The scalar values are arranged in n×n matrices. The factor vector values are calculated in the same manner as above by taking row averages or weighted row averages of the scalars in corresponding rows of the similarity matrix. For some other similarity measures, n×n matrices are not generated, rather Z-score for each unique field value in a particular field is calculated and used as elements in the respective factor vector. Examples of such similarity measures include length and frequency of occurrence similarity measures as explained above. - The factor vectors for all similarity measures are arranged column-wise to generate an
input matrix 232. Theinput matrix 232 inFIG. 2B has six factor vectors FV1 to FV6 corresponding to six similarity measures described above. Each row of theinput matrix 232 corresponds to a unique field value in a particular field of the fieldeddataset 202. Intuitively, an element of a factor vector for a given similarity measure specifies a likelihood that a corresponding unique field value in a field of the fielded dataset is anomalous in a context of the given similarity measure and conditioned on respective similarity measure values of other unique field values in the particular field of the fielded dataset for the given similarity measure. - A row of the
input matrix 232 represents a vector that encodes a likelihood that a corresponding unique field value in a particular field of the fielded dataset is anomalous in a context of the plurality of similarity measures and conditionable on respective similarity measure values of other unique field values of the particular of the fielded dataset for the plurality of linguistic similarity measures. -
FIG. 3 illustrates processing of theinput matrix 232 by convolutional neural network (CNN) of theanomaly detection network 114. The CNN can be a one-layer network or a two-layer network depending on the implementation. In a one-layer CNN, one set of filters are applied to theinput matrix 232. In a two-layer CNN, two sets of filters are applied to theinput matrix 232. A person skilled in the art will appreciate that additional layers can be added to the CNN. In the first step, 64 filters are row-wise convolved over theinput matrix 232. A row-wise convolution of a filter on theinput matrix 232 results in an evaluation vector (also referred to as a feature map). Note that the size of the filter is 1×k where k is the number of similarity measures. - For example,
filter 1 312 inFIG. 3 is convolved over the first row (y1, p1, q1, s1, t1, z1) of theinput matrix 232 to generate a scalar “a0” at index position “0” of theevaluation vector EV1 322. Convolving thefilter input matrix 232 generates a scalar “a1” at index position “1” of theevaluation vector EV1 322. The same process is followed to generate scalars up to “an” inevaluation vector EV1 322 corresponding to “n” rows in theinput matrix 232. A secondevaluation vector EV2 324 is generated by row-wise convolving afilter 2 314 over theinput matrix 232. The result of this convolution are scalars “b0” to “bn” inevaluation vector EV2 324. Sixty four (64) evaluation vectors (or feature maps)EV1 322 toEV64 326 are generated by convolving sixty four filters,filter 1 312 to filter 64 316 over theinput matrix 232. Theevaluation vectors EV1 322 toEV64 326 are provided as input to a fully connected (FC)neural network 332 to accumulate element-wise weighted sums of the evaluation vectors (feature maps) in anoutput vector 342. For example, the first element FC[0] of theoutput vector 342 is calculated as weighted sums of corresponding elements in all of the evaluation vectors (feature maps) i.e. W1.EV1[0]+W2.EV2[0]+ . . . +W64.EV64[0]. Theoutput vector 342 has “n” elements corresponding to the “n” rows of theinput matrix 232. - A
nonlinearity function 352 is applied to theoutput vector 342 to produce a normalizedoutput vector 362. Examples of nonlinearity functions include sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU) and leaky ReLU. Athreshold 372 is applied to the normalizedoutput vector 362 to determine anomalous data values. In theanomaly detection network 114, the elements of theoutput vector 382 represent inverse similarity of corresponding field values in a given field of the fielded dataset. The higher the value of an element of theoutput vector 382, the higher the likelihood that corresponding field value in the particular field is anomalous. For example, as shown inFIG. 3 , the second element O[1] 392 in theoutput vector 382 is above thethreshold 372. Therefore, the second field value (x2) of the particular field in thedataset 202 is identified as anomalous by theanomaly detection network 114. - Having described the operation of the
anomaly detection network 114 inFIG. 3 , we now describe the training of the FCneural network 332. InFIG. 3 , a forward pass of theanomaly detection network 114 is illustrated. During training, the results of theoutput vector 382 are compared with ground truth. For example, the anomalous field value x2 (corresponding to O[1] 392) is compared with the correct field value to determine whether the anomaly detection network correctly identified the anomalous word token in the field. In anomaly detection network, two cost functions are used to update the weights in the fully connectedneural network FC 332. The reason for using two cost functions is to avoid theanomaly detection network 114 from moving towards a everything-non-anomaly solution since anomalies are a small portion of a typical field of the fielded dataset. An additional benefit of using two cost functions is to use different learning rates to achieve a balance between anomaly and non-anomaly detection. This results in more accurate detection of anomalous field values as anomalous as well as more accurate detection of non-anomalous field values as non-anomalous. - The training data for the
anomaly detection network 114 is automatically generated by constructing positive and negative examples for inclusion in thetraining database 142. For each linguistic similarity measure a first set of field values are identified from a vocabulary which are similar to each other. A second set of field values are identified from the vocabulary that are dissimilar to each of the field values in the first set and the field values selected so far in the second set. The training dataset for the given linguistic similarity measure is generated by randomly selecting some field values from the first and second sets as positive and negative examples respectively. This process is repeated for the five linguistic similarity measures described above: semantic similarity, syntactical similarity, soundex similarity, length similarity and format similarity. For frequency similarity, the system randomly multiplies each of the unique input field value to increase its frequency of occurrence in the particular field of the fielded dataset. -
FIG. 4A shows a high level view of ananomaly detection network 114 with one convolution layer whileFIG. 4B shows the same with two convolution layers. -
FIG. 5A illustrates operation of theinput generator 112 to generate input data forsuggestion network 116. Theinput generator 112 takes adataset 502 as input. During training thedataset 502 is generated from thetraining database 142 and during production, thedataset 502 is generated from thetesting database 144. Thedataset 502 is composed of field values. In the more common use case, these values are word tokens. In other implementations, these can be character tokens or phrase tokens. An example of a word token is “John” and an example of a phrase token is “J. P. Morgan”. Thedataset 202 is a fielded dataset which means it can have fields such as “First Name”, “Last Name”, “Company Name”, “Country” etc., and data is organized as field values in the fields. - As opposed to the
input dataset 202 for theanomaly detection network 114, theinput dataset 502 contains field values x1 to xn that are non-anomalous. In one implementation, theinput dataset 502 contain the field values that have been processed by theanomaly detection network 114. The anomalous values have been identified and removed from theinput dataset 502. In addition to the “n” non-anomalous field values x1 to xn in a field, theinput dataset 502 also contains an input value also referred to as a target label (TL). Thesuggestion network 116 suggests one or more unique field values from the “n” non-anomalous field values to replace the target label (TL). - As described above for the
anomaly detection network 114, theinput generator 112, generatesfactor vectors 530 for a plurality of similarity measures (also referred to as linguistic similarity measures) for thesuggestion network 116.Suggestion network 116 uses the same six similarity measures as anomaly detection network 114: semantic similarity, syntactic similarity, soundex similarity, format similarity, length similarity, and frequency similarity. Afactor vector calculator 210 contains similarity measure calculators for each of the similarity measures.FIG. 2A shows six similarity measure calculators corresponding to six similarity measure listed above. The calculations of similarity measures is the same as explained above in theanomaly detection network 114. Thesemantic similarity calculator 212 calculates semantic similarity measure, thesyntactical similarity calculator 214 calculates syntactical similarity measure, thesoundex similarity calculator 216 calculates soundex similarity measure, thelength similarity calculator 218 calculates length similarity measure, the frequency of occurrence similarity calculator 220 calculates frequency similarity measure and the format similarity measure calculator 222 calculates format similarity measure. - In case of
anomaly detection network 114, similarity measures are calculated for every unique field value with every other unique field value in a particular field of the fielded dataset. Insuggestion network 116, similarity measures are calculated for each unique input field value with the target label (TL). Inanomaly detection network 114, the most dissimilar input field value is recommended as anomalous. Insuggestion network 116, the most similar input field value (to the target label) is recommended to replace the target label. In another implementation, more than one similar input field values are recommended as replacement values for the target label. Further evaluation of the recommended input field values is performed by an expert to select one unique field value to replace the target label. - Turning to the example
factor vector calculator 210 shown inFIG. 5A , the six similarity calculators generate sixfactor vectors 530 corresponding to the six similarity measures. The six factor vectors are denoted by the variables “y”, “p”, “q”, “s”, “t”, and “z”. Each factor comprises as many elements as words in a field of the fielded dataset. As shown in theexample dataset 502, there are “n” words x1 to xn in a field. Each word is also referred to as a field value. Each of the sixfactor vectors 530 consists of “n” values. For example, the first factor vector consists of the values y1 to yn the second factor vector consists of the values p1 to pn, the third factor vector consists of the values q1 to qn, the fourth factor vector consists of the values s1 to sn, the fifth factor vector consists of the values t1 to tn and the sixth factor vector consists of the values z1 to zn. When arranged column-wise, the sixexample factor vectors 530 shown inFIG. 5A form an input matrix which is shown inFIG. 5B referred to by a numeral 532. Theinput matrix 532 is given as input to thesuggestion network 116 by theinput generator 112. -
FIG. 5B shows further details of how factor vectors are calculated for similarity measures listed above. InFIG. 5B , factor vector calculations of two similarity measures are shown as an example. As opposed to theanomaly detection network 114, for thesuggestion network 116, thesemantic similarity matrix 542 andformat similarity matrix 548 compare only the target label (TL) with the input field values x1 to xn. A person skilled in the art will appreciate that other factor vectors may be calculated similarly either using matrix multiplications or scalar interactions. The examples shown inFIG. 5B rely on matrix multiplications. - The
first similarity matrix 542 is for calculation of semantic similarity measure. It is an n×1 matrix where n is the number of unique field values in a field in the fielded dataset. An inner product (also referred to as a dot product) is calculated between word embedding vector of each unique field value in the field with word embedding vector of target label. These embedding vectors are provided by either Word2Vec word embedding space, GloVe word embedding space, WordNet or any other low dimensional embedding space. The inner product between the word embedding vectors of two unique field values produces scalar values which are represented by variable Oa where “a” is the row index of thesemantic similarity matrix 542. The elements of the factor vector correspond to the rows of the similarity matrix. For example, value of the element y1 of a factor vector FV1 is equal to O1. The factor vector FV1 is composed of elements y1 to yn each of which is calculated by performing similar row operations on corresponding rows of the similarity matrix. - Accordingly, similarity matrices are calculated for other similarity measures. For some similarity measures such as semantic and syntactical similarity, dot products or inner products of vector representations of field values of each field in the fielded dataset are calculated with target label to generate scalars. Factor vectors are generated using the row values in similarity matrices. For other similarity measures such as soundex and format similarity measures, scalar values are generated by comparing every field value in a field with the target label using various underlying algorithms. The scalar values are arranged in n×1 matrices. The factor vector values are calculated in the same manner as above by using values of the scalars in corresponding rows of the similarity matrix. For some other similarity measures, n×1 matrices are not generated, rather Z-score for each field value in a field is calculated and used as elements in the corresponding factor vector. Examples of such similarity measures include length and frequency of occurrence similarity measures.
- The factor vectors for all similarity measures are arranged column-wise to generate an
input matrix 532. Theinput matrix 532 inFIG. 5B has six factor vectors FV1 to FV6 corresponding to six similarity measures described above. Each row of theinput matrix 532 corresponds to a field value in a field of the fieldeddataset 202. Intuitively, an element of the factor vector for the given similarity measure specifies a likelihood that a corresponding unique field value in the dataset is similar to the target label in a context of the given similarity measure and conditioned on respective similarity measure values of other unique field values in the dataset for the given similarity measure. - A row of the input matrix represents a vector that encodes a likelihood that a corresponding unique field value in the dataset is similar to the target label in a context of the plurality of linguistic similarity measures and conditionable on respective similarity measure values of other unique field values in the dataset for the plurality of linguistic similarity measures.
-
FIG. 6 illustrates processing of theinput matrix 532 by convolutional neural network (CNN) of thesuggestion network 116. The CNN is a one-layer network. In a one-layer CNN, one set of filters are applied to theinput matrix 532. A person skilled in the art will appreciate that additional layers can be added to the CNN. In the first step, 64 filters are row-wise convolved over theinput matrix 532. A row-wise convolution of a filter on theinput matrix 532 results in an evaluation vector (also referred to as a feature map). - For example,
filter 1 612 inFIG. 6 is convolved over the first row (y1, p1, q1, s1, t1, z1) of theinput matrix 532 to generate a scalar “a0” at index position “0” of theevaluation vector EV1 622. Convolving thefilter input matrix 532 generates a scalar “a1” at index position “1” of the evaluation vector EV1. The same process is followed to generate scalars up to “an” in evaluation vector EV1 corresponding to “n” rows in theinput matrix 532. A secondevaluation vector EV2 624 is generated by row-wise convolving afilter 2 614 over theinput matrix 532. The result of this convolution are scalars “b0” to “bn” inevaluation vector EV2 624. Sixty four (64) evaluation vectors (or feature maps)EV1 622 toEV64 626 are generated by convolving sixty four filters,filter 1 612 to filter 64 616 over theinput matrix 532. Theevaluation vectors EV1 622 toEV64 626 are provided as input to a fully connected (FC)neural network 632 to accumulate element-wise weighted sums of the evaluation vectors (feature maps) in anoutput vector 642. For example, the first element FC[0] of theoutput vector 642 is calculated as weighted sums of corresponding elements in all of the evaluation vectors (feature maps) i.e. W1.EV1[0]+W2.EV2[0]+ . . . +W64.EV64[0]. Theoutput vector 642 has “n” elements corresponding to the “n” rows of theinput matrix 532. - A
nonlinearity function 652 is applied to theoutput vector 642 to produce a normalizedoutput vector 662. Examples of nonlinearity functions include sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU) and leaky ReLU. Athreshold 672 is applied to the normalizedoutput vector 662 to determine similar data values. Insuggestion network 116, the elements of theoutput vector 682 represent similarity of corresponding field values in a given input field of the fielded dataset to the target label. The higher the value of an element of theoutput vector 682, the higher the likelihood that corresponding field values in the input field is similar to the target label. For example, as shown inFIG. 6 , the first element in the output vector is above thethreshold 672. Therefore, this element (x1) of the input field in thedataset 502 is recommended by the suggestion network as a replacement for target label. In another implementation multiple input field values can be recommended by thesuggestion network 116. An expert can select one input field value from the suggested values to replace the target label. - Having described the operation of the
suggestion network 116 inFIG. 6 , we now describe the training of the FCneural network 632. InFIG. 6 , a forward pass of thesuggestion network 116 is described. During training, the results of theoutput vector 682 are compared with ground truth. For example, the suggestedfield value 682 is compared with the correct value to determine whether the suggestion network correctly identified the replacement field value in the field. Insuggestion network 116, one cost function is used to update the weights in the fully connectedneural network FC 632. For example, as shown inFIG. 6 , the first element O[0] 692 in theoutput vector 682 is above thethreshold 372. Therefore, the first field value (x1) of the particular field in thedataset 502 is used to replace the anomalous input field or the target label (TL) by thesuggestion network 116. - The training data for the
suggestion network 116 is automatically generated by constructing positive and negative examples for inclusion in the training dataset. For each linguistic similarity measure a first set of field values are identified from a vocabulary which are similar to each other. A second set of field values are also identified from the vocabulary that are dissimilar to each of the field values in the first set and the field values selected so far in the second set. The training dataset for the given linguistic similarity measure is generated by randomly selecting some field values from the first and second sets as positive and negative examples respectively. This process is repeated for the five linguistic similarity measures described above: semantic similarity, syntactical similarity, soundex similarity, length similarity and format similarity. For frequency similarity, the system randomly multiplies each of the unique field value to increase its frequency of occurrence in the particular field of the fielded dataset. A target label is randomly selected from a set of anomalous data values. -
FIG. 7 shows a high level view of asuggestion network 116 with one convolution layer. -
FIG. 8 is a simplified block diagram 800 of acomputer system 810 that can be used to implement themachine learning system 110.Computer system 810 typically includes at least oneprocessor 814 that communicates with a number of peripheral devices via bus subsystem 812. These peripheral devices can include astorage subsystem 824 including, for example, memory devices and a file storage subsystem, user interface input devices 822, user interface output devices 820, and anetwork interface subsystem 816. The input and output devices allow user interaction withcomputer system 810.Network interface subsystem 816 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems. - User interface input devices 822 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices. In general, use of the term “input device” is intended to include all possible types of devices and ways to input information into
computer system 810. - User interface output devices 820 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem can include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem can also provide a non-visual display such as audio output devices. In general, use of the term “output device” is intended to include all possible types of devices and ways to output information from
computer system 810 to the user or to another machine or computer system. -
Storage subsystem 824 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed byprocessor 814 alone or in combination with other processors. -
Memory subsystem 826 used in the storage subsystem can include a number of memories including a main random access memory (RAM) 830 for storage of instructions and data during program execution and a read only memory (ROM) 832 in which fixed instructions are stored. Afile storage subsystem 828 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations can be stored byfile storage subsystem 828 in thestorage subsystem 824, or in other machines accessible by the processor. - Bus subsystem 812 provides a mechanism for letting the various components and subsystems of
computer system 810 communicate with each other as intended. Although bus subsystem 812 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses. -
Computer system 810 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description ofcomputer system 810 depicted inFIG. 8 is intended only as one example. Many other configurations ofcomputer system 810 are possible having more or fewer components than the computer system depicted inFIG. 8 . - Anomaly Detection Network
- The technology disclosed relates to detection of anomalous field values for a particular field in a fielded dataset.
- The technology disclosed can be practiced as a system, method, or article of manufacture. One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are taught to be combinable. One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections—these recitations are hereby incorporated forward by reference into each of the following implementations.
- A first system implementation of the technology disclosed includes one or more processors coupled to the memory. The memory is loaded with computer instructions to detect an anomalous field value. The system determines which field values for a particular field in a fielded dataset are anomalous. The system compares a particular unique field value to the other unique field values for the particular field by applying a plurality of similarity measures and generates a factor vector that has one scalar for each of the unique field values. The system then compares the factor vector using convolution filters in a convolutional neural network (abbreviated CNN) to generate evaluation vectors (also referred to as feature maps). The system further evaluates the evaluation vectors using a fully connected (abbreviated FC) neural network to produce an anomaly scalar for the particular unique field value. A threshold is applied to the anomaly scalar to determine whether the particular unique field value is anomalous.
- This system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- The system uses a plurality of similarity measures including semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
- The system further includes determining that one or more field values in the fielded dataset are similar to an input value (also referred to as a target label) for a particular field. The system compares a particular input value to the unique field values for the particular input field by applying a plurality of similarity measures. This results in generation of a factor vector that has one scalar for each of the unique field values. The system evaluates the factor vector using the convolution filters in the CNN to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values. The system uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- The system includes calculating factor vectors for some of the similarity measures by calculating an inner product between respective similarity measure values of the unique field values in the dataset to form a similarity matrix. A row-wise average of the inner product results in the similarity matrix is calculated. A factor vector is formulated for the given similarity measure by arranging the row-wise averages as elements of the factor vector.
- An element of the factor vector the given linguistic similarity measure specifies a likelihood that a corresponding unique field value in the dataset is anomalous in a context of the given linguistic similarity measure. Additionally, the element of the factor vector is also conditioned on respective similarity measure values of other unique values in the dataset for the given linguistic similarity measure.
- The system generates an input for the convolutional neural network (abbreviated CNN) by column-wise arranging the factor vectors in an input matrix. In such an implementation, the convolution filters apply row-wise on the input matrix. Further, in such an implementation, a row in the input matrix represents a vector that encodes a likelihood that a corresponding unique field value in the dataset is anomalous in a context of the plurality of linguistic similarity measures and conditionable on respective similarity measure values of other unique field values in the dataset for the plurality of linguistic similarity measures.
- The system automatically constructs positive and negative examples for inclusion in a training dataset. For a given linguistic similarity measure, the system constructs the training dataset by determining a first set of similar field values from a vocabulary and determines a second set of dissimilar field values from the vocabulary. The system then randomly selects some field values from the first set as positive examples and the second set as negative examples respectively. The system repeats the above process for each linguistic similarity measure to determine and select positive and negative training examples. The system stores the randomly selected field values for the plurality of similarity measures as the training dataset.
- The system trains the convolutional neural network (CNN) and the fully connected (FC) neural network using the positive and negative examples in the training dataset.
- The system uses at least two cost functions to evaluate performance of the CNN and the FC neural network during training. A first cost function evaluates classification of unique field values as anomalies and a second cost function evaluates classification of unique field values as non-anomalies. In such an implementation, the system calculates separate gradients for the two cost functions and backpropagates the gradients to the CNN and the FC neural network during training.
- In one implementation of the system, the convolutional neural network (CNN) is a one-layer CNN. In another implementation of the system, the convolutional neural network is a two-layer CNN.
- A second system implementation of the technology disclosed includes one or more processors coupled to the memory. The memory is loaded with computer instructions to detect linguistically anomalous field values in a dataset. The system calculates at least one factor vector for each of a plurality of linguistic similarity measures. For each linguistic similarity measure, the system calculates its factor vector by averaging product results and/or distribution values calculated from similarity measure values of unique field values in the dataset for the given linguistic similarity measure. The factor vectors are provided as input to a convolutional neural network (abbreviated CNN). The system applies convolution filters to the factor vectors to generate evaluation vectors (also referred to as feature maps). Following this, the system provides the evaluation vectors as input to a fully-connected (abbreviated FC) neural network to accumulate element-wise weighted sums of the evaluation vectors in an output vector. Following this, the system applies a nonlinearity function to the output vector to produce a normalized output vector. Finally, the system applies thresholding to the normalized output vector to identify anomalous and similar field values in the dataset.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to the second system implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above. Yet another implementation may include a method performing the functions of the system described above.
- A first method implementation of the technology disclosed includes detecting anomalous field values. The method includes, determining which field values for a particular field in a fielded dataset are anomalous. A particular unique field value is compared to other unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values. The method then compares the factor vector using convolution filters in a convolutional neural network (abbreviated CNN) to generate evaluation vectors (also referred to as feature maps). The method further evaluates the evaluation vectors using a fully connected (abbreviated FC) neural network to produce an anomaly scalar for the particular unique field value. A threshold is applied to the anomaly scalar to determine whether the particular unique field value is anomalous.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the first method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the first method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- A second method implementation of the technology disclosed includes detecting linguistically anomalous field values in a dataset. The method includes calculating at least one factor vector for each of a plurality of linguistic similarity measures. For a given linguistic similarity measure, the method calculates its factor vector by averaging product results and/or distribution values calculated from similarity measure values of unique field values in the dataset for the given linguistic similarity measure. The method includes providing the factor vectors as input to a convolutional neural network (abbreviated CNN) and applying convolution filters to the factor vectors to generate evaluation vectors (also referred to as feature maps). Following this, the method includes providing the evaluation vectors as input to a fully-connected (abbreviated FC) neural network to accumulate element-wise weighted sums of the evaluation vectors in an output vector. Following this, the method includes applying a nonlinearity function to the output vector to produce a normalized output vector. Finally, the method includes thresholding the normalized output vector to identify anomalous and similar field values in the dataset.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the second method described above.
- Each of the features discussed in this particular implementation section for the first system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- A third system implementation of the technology disclosed includes one or more processors coupled to the memory. The memory is loaded with computer instructions to suggest one or more candidates for a particular input value. The system determines that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset. The system performs this determination by comparing a particular input value to unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values. Following this, the system evaluates the factor vector using convolution filters in a convolutional neural network (CNN) to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values. The system further evaluates the evaluation vectors using a fully-connected (FC) neural network to produce suggestion scalars for similarity to the particular input value. Finally, the system uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- This system implementation and other systems disclosed optionally include one or more of the following features. System can also include features described in connection with methods disclosed. In the interest of conciseness, alternative combinations of system features are not individually enumerated. Features applicable to systems, methods, and articles of manufacture are not repeated for each statutory class set of base features. The reader will understand how features identified in this section can readily be combined with base features in other statutory classes.
- The system uses a plurality of similarity measures including semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
- The system constructs an input to the CNN by column-wise arranging one or more factor vectors in an input matrix. In such an implementation, the convolution filters apply row-wise on the input matrix.
- The system automatically constructs positive and negative examples for inclusion in a training dataset. For a given linguistic similarity measure, the system determines a first set of similar field values from a vocabulary and determines a second set of dissimilar field values from the vocabulary. Following this, the system randomly selects some field values from the first and second sets as positive and negative examples respectively. The system repeats the above process of determining the first set and the second set of field values for a plurality of similarity measures. Finally, the system stores the randomly selected field values for the plurality of similarity measures as the training dataset.
- In such an implementation, the system further includes training the CNN and the FC neural network using the positive and negative examples in the training dataset.
- The system uses at least one cost function to evaluate performance of the CNN and the FC neural network during training.
- In one implementation of the system, the convolutional neural network (CNN) is a one-layer CNN. In another implementation of the system, the convolutional neural network (CNN) is a two-layer CNN.
- Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform functions of the system described above. Yet another implementation may include a method performing the functions of the system described above.
- In one implementation, the system includes, determining which field values for a particular field in the fielded dataset are anomalous. The system performs this determination by comparing a particular unique field value to other unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values. Following this, the system evaluates the factor vector using the convolution filters in the CNN to generate evaluation vectors (also referred to as feature maps). The system further evaluates the evaluation vectors using the FC neural network to produce an anomaly scalar for the particular unique field value. Finally, the system applies thresholding to the anomaly scalar to determine whether the particular unique field value is anomalous.
- A third method implementation of the technology disclosed includes suggesting one or more candidates for a particular input value. The method includes, determining that one or more field values in a set of field values are similar to an input value for a particular field in a fielded dataset. The method performs this determination by comparing a particular input value to unique field values for the particular field by applying a plurality of similarity measures and generating a factor vector that has one scalar for each of the unique field values. Following this, the method evaluates the factor vector using convolution filters in a convolutional neural network (CNN) to generate evaluation vectors (also referred to as feature maps) for similarity to the unique field values. The method further evaluates the evaluation vectors using a fully-connected (FC) neural network to produce suggestion scalars for similarity to the particular input value. Finally, the method uses the suggestion scalars to determine one or more suggestion candidates for the particular input value.
- Each of the features discussed in this particular implementation section for the third system implementation apply equally to this method implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- Other implementations may include a non-transitory computer readable storage medium storing instructions executable by a processor to perform the third method described above. Yet another implementation may include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform the first method described above.
- Computer readable media (CRM) implementations of the technology disclosed include a non-transitory computer readable storage medium impressed with computer program instructions, when executed on a processor, implement the method described above.
- Each of the features discussed in this particular implementation section for the third system implementation apply equally to the CRM implementation. As indicated above, all the system features are not repeated here and should be considered repeated by reference.
- The technology disclosed, and particularly the
anomaly detection network 114 and thesuggestion network 116, can be implemented in the context of any computer-implemented system including a database system, a multi-tenant environment, or a relational database implementation like an Oracle™ compatible database implementation, an IBM DB2 Enterprise Server™ compatible relational database implementation, a MySQL™ or PostgreSQL™ compatible relational database implementation or a Microsoft SQL Server™ compatible relational database implementation or a NoSQL™ non-relational database implementation such as a Vampire™ compatible non-relational database implementation, an Apache Cassandra™ compatible non-relational database implementation, a BigTable™ compatible non-relational database implementation, or an HBase™ or DynamoDB™ compatible non-relational database implementation. In addition, the technology disclosed can be implemented using different programming models like MapReduce™, bulk synchronous programming, MPI primitives, etc., or different scalable batch and stream management systems like Amazon Web Services (AWS)™, including Amazon Elasticsearch Service™ and Amazon Kinesis™, Apache Storm™ Apache Spark™, Apache Kafka™, Apache Flink™, Truviso™, IBM Info-Sphere™, Borealis™ and Yahoo! S4™. - Any data structures and code described or referenced above are stored according to many implementations on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. This includes, but is not limited to, volatile memory, non-volatile memory, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
- The preceding description is presented to enable the making and use of the technology disclosed. Various modifications to the disclosed implementations will be apparent, and the general principles defined herein may be applied to other implementations and applications without departing from the spirit and scope of the technology disclosed. Thus, the technology disclosed is not intended to be limited to the implementations shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. The scope of the technology disclosed is defined by the appended claims.
Claims (20)
1. A system of an anomaly detection network, including:
a communication interface receiving a set of input values in a format of an input matrix;
a memory storing the anomaly detection network comprising one or more convolution layers and a fully-connected layer connected to the one or more convolution layers;
one or more processors coupled to the memory to perform operations based on the anomaly detection network, comprising:
generating, by the one or more convolution layers, one or more evaluation vectors from the input matrix,
generating, by the fully-connected layer, accumulated element-wise weighted sums of the one or more evaluation vectors to form an output vector, and
determining an indication that suggests an anomaly in the set of input values based on the output vector.
2. The system of claim 1 , wherein the set of input values are for a particular field in a fielded dataset.
3. The system of claim 1 , wherein the one or more convolution layers includes at least one convolutional filter that convolve a first row of the input matrix to compute a first entry in a first evaluation vector in the one or more evaluation vectors.
4. The system of claim 1 , wherein the memory further stores a factor vector calculator connected to the one or more convolution layers, and the factor vector calculator comprises a plurality of similarity measure calculators configured to apply a plurality of similarity measures to the set of input values, respectively, and
wherein the factor vector calculator is further configured to compute factor vectors based on the plurality of similarity measures to form the input matrix.
5. The system of claim 3 , wherein the plurality of similarity measures include any combination of semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
6. The system of claim 1 , wherein the memory further stores a non-linear module connected to the fully-connected layer,
wherein the non-linear module is configured to normalize the output vector by any of a sigmoid function, a hyperbolic tangent (tanh) function, a rectified linear unit (ReLU) and a leaky ReLU.
7. The system of claim 1 , wherein the operations further comprise:
determining whether the set of input values contains the anomaly by comparing each entry in the output vector with a pre-defined threshold.
8. The system of claim 1 , wherein the operations further comprise:
determining one or more suggestion candidates to replace a particular input value in the set of input values based on a corresponding entry that corresponds to the particular input value in the output vector.
9. A method for anomaly detection in a set of fields values, the method comprising:
receiving, via a communication interface, a set of input values in a format of an input matrix;
generating, by one or more convolution layers, one or more evaluation vectors from the input matrix,
generating, by a fully-connected layer connected to the one or more convolution layers, accumulated element-wise weighted sums of the one or more evaluation vectors to form an output vector, and
determining an indication that suggests an anomaly in the set of input values based on the output vector.
10. The method of claim 9 , wherein the one or more convolution layers includes at least one convolutional filter that convolve a first row of the input matrix to compute a first entry in a first evaluation vector in the one or more evaluation vectors.
11. The method of claim 9 , further comprising:
applying, by a factor vector calculator connected to the one or more convolution layers, a plurality of similarity measures to the set of input values, respectively; and
computing factor vectors based on the plurality of similarity measures to form the input matrix.
12. The method of claim 11 , wherein the plurality of similarity measures include any combination of semantic similarity, syntactic similarity, soundex similarity, character-by-character format similarity, field length similarity, and dataset frequency similarity.
13. The method of claim 9 , further comprising:
normalizing, by a non-linear function module, the output vector by any of a sigmoid function, a hyperbolic tangent (tanh) function, a rectified linear unit (ReLU) and a leaky ReLU.
14. The method of claim 9 , further comprising:
determining whether the set of input values contains the anomaly by comparing each entry in the output vector with a pre-defined threshold.
15. The method of claim 1 , further comprising:
determining one or more suggestion candidates to replace a particular input value in the set of input values based on a corresponding entry that corresponds to the particular input value in the output vector.
16. A non-transitory processor-executable storage medium storing a plurality of processor-executable instructions for anomaly detection in a set of fields values, the instructions being executed by a processor to perform operations comprising:
receiving a set of input values in a format of an input matrix;
generating, by one or more convolution layers, one or more evaluation vectors from the input matrix,
generating, by a fully-connected layer connected to the one or more convolution layers, accumulated element-wise weighted sums of the one or more evaluation vectors to form an output vector, and
determining an indication that suggests an anomaly in the set of input values based on the output vector.
17. The non-transitory processor-executable storage medium of claim 16 , wherein the one or more convolution layers includes at least one convolutional filter that convolve a first row of the input matrix to compute a first entry in a first evaluation vector in the one or more evaluation vectors.
18. The non-transitory processor-executable storage medium of claim 16 , wherein the operations further comprise:
applying, by a factor vector calculator connected to the one or more convolution layers, a plurality of similarity measures to the set of input values, respectively; and
computing factor vectors based on the plurality of similarity measures to form the input matrix.
19. The non-transitory processor-executable storage medium of claim 16 , wherein the operations further comprise:
determining whether the set of input values contains the anomaly by comparing each entry in the output vector with a pre-defined threshold.
20. The non-transitory processor-executable storage medium of claim 16 , wherein the operations further comprise:
determining one or more suggestion candidates to replace a particular input value in the set of input values based on a corresponding entry that corresponds to the particular input value in the output vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/373,600 US20220004843A1 (en) | 2017-10-05 | 2021-07-12 | Convolutional neural network (cnn)-based anomaly detection |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/726,267 US11093816B2 (en) | 2017-10-05 | 2017-10-05 | Convolutional neural network (CNN)-based anomaly detection |
US17/373,600 US20220004843A1 (en) | 2017-10-05 | 2021-07-12 | Convolutional neural network (cnn)-based anomaly detection |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/726,267 Continuation US11093816B2 (en) | 2017-10-05 | 2017-10-05 | Convolutional neural network (CNN)-based anomaly detection |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220004843A1 true US20220004843A1 (en) | 2022-01-06 |
Family
ID=65992572
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/726,267 Active 2040-06-18 US11093816B2 (en) | 2017-10-05 | 2017-10-05 | Convolutional neural network (CNN)-based anomaly detection |
US17/373,600 Pending US20220004843A1 (en) | 2017-10-05 | 2021-07-12 | Convolutional neural network (cnn)-based anomaly detection |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/726,267 Active 2040-06-18 US11093816B2 (en) | 2017-10-05 | 2017-10-05 | Convolutional neural network (CNN)-based anomaly detection |
Country Status (1)
Country | Link |
---|---|
US (2) | US11093816B2 (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10565305B2 (en) | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
US10565318B2 (en) | 2017-04-14 | 2020-02-18 | Salesforce.Com, Inc. | Neural machine translation with latent tree attention |
US11087211B2 (en) * | 2017-10-05 | 2021-08-10 | Salesforce.Com, Inc. | Convolutional neural network (CNN)-based suggestions for anomaly input |
US11604956B2 (en) | 2017-10-27 | 2023-03-14 | Salesforce.Com, Inc. | Sequence-to-sequence prediction using a neural network model |
US11170287B2 (en) | 2017-10-27 | 2021-11-09 | Salesforce.Com, Inc. | Generating dual sequence inferences using a neural network model |
US10573295B2 (en) | 2017-10-27 | 2020-02-25 | Salesforce.Com, Inc. | End-to-end speech recognition with policy learning |
US10592767B2 (en) | 2017-10-27 | 2020-03-17 | Salesforce.Com, Inc. | Interpretable counting in visual question answering |
US11562287B2 (en) | 2017-10-27 | 2023-01-24 | Salesforce.Com, Inc. | Hierarchical and interpretable skill acquisition in multi-task reinforcement learning |
US11928600B2 (en) | 2017-10-27 | 2024-03-12 | Salesforce, Inc. | Sequence-to-sequence prediction using a neural network model |
US10542270B2 (en) | 2017-11-15 | 2020-01-21 | Salesforce.Com, Inc. | Dense video captioning |
US11276002B2 (en) | 2017-12-20 | 2022-03-15 | Salesforce.Com, Inc. | Hybrid training of deep networks |
US11227218B2 (en) | 2018-02-22 | 2022-01-18 | Salesforce.Com, Inc. | Question answering from minimal context over documents |
US10929607B2 (en) | 2018-02-22 | 2021-02-23 | Salesforce.Com, Inc. | Dialogue state tracking using a global-local encoder |
US11568306B2 (en) | 2019-02-25 | 2023-01-31 | Salesforce.Com, Inc. | Data privacy protected machine learning systems |
CN111047036B (en) * | 2019-12-09 | 2023-11-14 | Oppo广东移动通信有限公司 | Neural network processor, chip and electronic equipment |
US11741511B2 (en) * | 2020-02-03 | 2023-08-29 | Intuit Inc. | Systems and methods of business categorization and service recommendation |
US11762990B2 (en) * | 2020-04-07 | 2023-09-19 | Microsoft Technology Licensing, Llc | Unstructured text classification |
CN113821791B (en) * | 2020-06-18 | 2024-07-12 | 中国电信股份有限公司 | Method, system, storage medium and device for detecting SQL injection |
US11875294B2 (en) | 2020-09-23 | 2024-01-16 | Salesforce, Inc. | Multi-objective recommendations in a data analytics system |
US11792438B2 (en) * | 2020-10-02 | 2023-10-17 | Lemon Inc. | Using neural network filtering in video coding |
CN113672976B (en) * | 2021-08-04 | 2024-07-16 | 支付宝(杭州)信息技术有限公司 | Sensitive information detection method and device |
CN114692783B (en) * | 2022-04-22 | 2024-04-12 | 中国地质大学(北京) | Intelligent service abnormality detection method based on hierarchical graph deviation network |
CN118331831A (en) * | 2023-10-20 | 2024-07-12 | 天翼爱音乐文化科技有限公司 | Application system efficiency evaluation method and device, electronic equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030161513A1 (en) * | 2002-02-22 | 2003-08-28 | The University Of Chicago | Computerized schemes for detecting and/or diagnosing lesions on ultrasound images using analysis of lesion shadows |
US20150178944A1 (en) * | 2013-12-20 | 2015-06-25 | Alcatel-Lucent Usa Inc. | Methods and apparatuses for detecting anomalies in the compressed sensing domain |
US20150227591A1 (en) * | 2014-02-12 | 2015-08-13 | International Business Machines Corporation | System and Method for Automatically Validating Classified Data Objects |
US20160196479A1 (en) * | 2015-01-05 | 2016-07-07 | Superfish Ltd. | Image similarity as a function of weighted descriptor similarities derived from neural networks |
US20170372232A1 (en) * | 2016-06-27 | 2017-12-28 | Purepredictive, Inc. | Data quality detection and compensation for machine learning |
US20180082443A1 (en) * | 2016-09-21 | 2018-03-22 | Realize, Inc. | Anomaly detection in volumetric images |
US20180096243A1 (en) * | 2016-09-30 | 2018-04-05 | General Electric Company | Deep learning for data driven feature representation and anomaly detection |
US20190197425A1 (en) * | 2016-09-16 | 2019-06-27 | Siemens Aktiengesellschaft | Deep convolutional factor analyzer |
US11854308B1 (en) * | 2016-02-17 | 2023-12-26 | Ultrahaptics IP Two Limited | Hand initialization for machine learning based gesture recognition |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9275339B2 (en) * | 2012-04-24 | 2016-03-01 | Raytheon Company | System and method for probabilistic name matching |
US9715660B2 (en) * | 2013-11-04 | 2017-07-25 | Google Inc. | Transfer learning for deep neural network based hotword detection |
US20160180214A1 (en) * | 2014-12-19 | 2016-06-23 | Google Inc. | Sharp discrepancy learning |
-
2017
- 2017-10-05 US US15/726,267 patent/US11093816B2/en active Active
-
2021
- 2021-07-12 US US17/373,600 patent/US20220004843A1/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030161513A1 (en) * | 2002-02-22 | 2003-08-28 | The University Of Chicago | Computerized schemes for detecting and/or diagnosing lesions on ultrasound images using analysis of lesion shadows |
US20150178944A1 (en) * | 2013-12-20 | 2015-06-25 | Alcatel-Lucent Usa Inc. | Methods and apparatuses for detecting anomalies in the compressed sensing domain |
US20150227591A1 (en) * | 2014-02-12 | 2015-08-13 | International Business Machines Corporation | System and Method for Automatically Validating Classified Data Objects |
US20160196479A1 (en) * | 2015-01-05 | 2016-07-07 | Superfish Ltd. | Image similarity as a function of weighted descriptor similarities derived from neural networks |
US11854308B1 (en) * | 2016-02-17 | 2023-12-26 | Ultrahaptics IP Two Limited | Hand initialization for machine learning based gesture recognition |
US20170372232A1 (en) * | 2016-06-27 | 2017-12-28 | Purepredictive, Inc. | Data quality detection and compensation for machine learning |
US20190197425A1 (en) * | 2016-09-16 | 2019-06-27 | Siemens Aktiengesellschaft | Deep convolutional factor analyzer |
US20180082443A1 (en) * | 2016-09-21 | 2018-03-22 | Realize, Inc. | Anomaly detection in volumetric images |
US20180096243A1 (en) * | 2016-09-30 | 2018-04-05 | General Electric Company | Deep learning for data driven feature representation and anomaly detection |
Non-Patent Citations (1)
Title |
---|
• NPL: Munawar, Asim, Phongtharin Vinayavekhin, and Giovanni De Magistris. "Spatio-temporal anomaly detection for industrial robots through prediction in unsupervised feature space." (May, 2017). (Year: 2017) * |
Also Published As
Publication number | Publication date |
---|---|
US11093816B2 (en) | 2021-08-17 |
US20190108432A1 (en) | 2019-04-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220004843A1 (en) | Convolutional neural network (cnn)-based anomaly detection | |
US11087211B2 (en) | Convolutional neural network (CNN)-based suggestions for anomaly input | |
US11126890B2 (en) | Robust training of large-scale object detectors with a noisy dataset | |
CN110084216B (en) | Face recognition model training and face recognition method, system, device and medium | |
US11797822B2 (en) | Neural network having input and hidden layers of equal units | |
CN110377740B (en) | Emotion polarity analysis method and device, electronic equipment and storage medium | |
US20200311519A1 (en) | Systems and methods for deep skip-gram network based text classification | |
US20230222285A1 (en) | Layout-Aware Multimodal Pretraining for Multimodal Document Understanding | |
US20180308003A1 (en) | Hybrid approach to approximate string matching using machine learning | |
CA3039551A1 (en) | Training a joint many-task neural network model using successive regularization | |
US11875233B2 (en) | Automatic recognition of entities related to cloud incidents | |
US20180203836A1 (en) | Predicting spreadsheet properties | |
US20200311542A1 (en) | Encoder Using Machine-Trained Term Frequency Weighting Factors that Produces a Dense Embedding Vector | |
US11257592B2 (en) | Architecture for machine learning model to leverage hierarchical semantics between medical concepts in dictionaries | |
US11423436B2 (en) | Interpretable click-through rate prediction through hierarchical attention | |
US20210357766A1 (en) | Classification of maintenance reports for modular industrial equipment from free-text descriptions | |
US11379685B2 (en) | Machine learning classification system | |
US20210056264A1 (en) | Neologism classification techniques | |
Chen et al. | Survey: Exploiting data redundancy for optimization of deep learning | |
US20230196804A1 (en) | Object annotation using sparse active learning and core set selection | |
Mishra | PyTorch Recipes: A Problem-Solution Approach | |
US11783609B1 (en) | Scalable weak-supervised learning with domain constraints | |
Oliveira et al. | OPTIC: A Deep Neural Network Approach for Entity Linking using Word and Knowledge Embeddings. | |
US20240290095A1 (en) | Method, electronic device, and computer program product for extracting target frame | |
US20230222778A1 (en) | Core set discovery using active learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: SALESFORCE.COM, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LU, CHANG;ZHANG, LINGTAO;SIGNING DATES FROM 20170930 TO 20171003;REEL/FRAME:057855/0030 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |