CN113609290A - Address recognition method and device and storage medium - Google Patents
Address recognition method and device and storage medium Download PDFInfo
- Publication number
- CN113609290A CN113609290A CN202110856358.4A CN202110856358A CN113609290A CN 113609290 A CN113609290 A CN 113609290A CN 202110856358 A CN202110856358 A CN 202110856358A CN 113609290 A CN113609290 A CN 113609290A
- Authority
- CN
- China
- Prior art keywords
- word
- group
- preset
- address information
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000013598 vector Substances 0.000 claims abstract description 110
- 238000003062 neural network model Methods 0.000 claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 49
- 238000013145 classification model Methods 0.000 claims abstract description 43
- 230000011218 segmentation Effects 0.000 claims description 34
- 238000012545 processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 5
- 238000007637 random forest analysis Methods 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000001680 brushing effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Finance (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Accounting & Taxation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
Abstract
The embodiment of the application provides an address identification method and device and a storage medium, and the method comprises the following steps: extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training.
Description
Technical Field
The present application relates to the field of electronic applications, and in particular, to an address identification method and apparatus, and a storage medium.
Background
The network black products have posed a great threat to e-commerce platforms, and attackers from the black products register account numbers mainly through each large e-commerce platform, and use the account numbers to carry out cheating behaviors such as assisting merchants to swipe bills and swizzle credits, so that impact is caused to an evaluation system of e-commerce, and benefits of the e-commerce platform, merchants operating normally and buyers are damaged. Particularly, in the logistics distribution link, a large number of network black products influence the normal ecology of the e-commerce industry, the logistics distribution industry and the local life industry through actions of order brushing or malicious order placement and the like. In order to exploit the risk of address cheating therein, it is highly desirable to be able to identify false addresses in trade orders.
However, the current rule-based risk scoring technology needs to determine rules based on expert business experience, which results in slow updating, small coverage and high possibility of missed judgment; therefore, a risk scoring method based on a supervised classification technology is provided, however, the false addresses are continuously changed, and the number of the real address samples is far greater than that of the false address samples, so that the false addresses cannot be accurately identified for complex false address detection.
Disclosure of Invention
The embodiment of the application provides an address identification method and device and a storage medium, which can improve the accuracy of identifying a false address.
The technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application provides an address identification method, where the method includes:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
In a second aspect, an embodiment of the present application provides an address identification apparatus, including:
the extraction unit is used for extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to the preset classification dimension;
the prediction unit is used for sequentially predicting the association probability between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit is used for inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, and the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
In a third aspect, an embodiment of the present application provides an address identification apparatus, where the apparatus includes: a processor, a memory, and a communication bus; the processor implements the method as described above when executing the running program stored in the memory.
In a fourth aspect, an embodiment of the present application provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method as described above.
The embodiment of the application provides an address identification method and device and a storage medium, wherein the method comprises the following steps: extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training. By adopting the implementation scheme, the combination form of the preset neural network model obtained by unsupervised training and the preset classification model obtained by supervised training is adopted, the address information to be recognized is converted into word segmentation information of a text sequence mode according to the preset classification dimensionality, the feature information in the address text can be subjected to association probability calculation by using the unsupervised training preset neural network model, then the class recognition is performed by using the preset classification model, the key features capable of distinguishing true and false addresses can be extracted, and the accuracy of false address recognition is improved.
Drawings
Fig. 1 is a flowchart of an address identification method according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating an exemplary address identification process provided by an embodiment of the present application;
fig. 3 is a first schematic structural diagram of an address identification apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an address identification device according to an embodiment of the present application.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application. And are not intended to limit the present application.
An embodiment of the present application provides an address identification method, as shown in fig. 1, the method may include:
s101, extracting at least one group of word vectors and at least one word statistic number from the address information to be recognized according to preset classification dimensions.
The address identification method provided by the embodiment of the application is suitable for a scene of carrying out false address identification on addresses in trade orders on an e-commerce platform, a logistics distribution platform and a local life platform.
In the embodiment of the present application, the device for performing address recognition may be any device having data processing and storing functions, for example: tablet computers, mobile phones, Personal Computers (PCs), notebook computers, wearable devices, and the like.
In the embodiment of the application, corresponding address information to be recognized is obtained from a transaction order, data preprocessing such as word segmentation, word frequency statistics and the like is firstly carried out on the address information to be recognized, specifically, word segmentation is carried out on the address information to be recognized, and a word segmentation result is obtained; performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value. And then, extracting at least one group of word vectors and at least one word statistical quantity from the preprocessed address information to be recognized according to a preset classification dimension.
It should be noted that, in the embodiment of the present application, word segmentation may be performed on address information to be recognized in a word segmentation manner such as jieba word segmentation, and a specific word segmentation manner may be selected according to an actual situation, which is not specifically limited in the embodiment of the present application.
Specifically, the address recognition device divides the first word into at least one group of words according to a preset classification dimension, and performs word vector extraction on the at least one group of words to obtain at least one group of word vectors; the address recognition device counts the number of second participles, wherein the number of the second participles is one statistical number in at least one word statistical number.
It should be noted that, in the embodiment of the present application, word vector extraction may be performed on the first segmentation by using a souguo pre-training word vector. The specific word vector extraction method can be selected according to actual conditions, and the embodiment of the present application is not particularly limited.
It can be understood that word vector extraction is performed on the first word with the word frequency higher than the preset word frequency threshold value, and then the first word is input into the preset neural network model for probability calculation, so that the vocabulary dimension of the preset neural network model can be reduced, the processing data volume of the preset neural network model is greatly reduced, and the processing speed of the preset neural network model is improved.
In the embodiment of the application, the preset classification dimensions comprise seven classification dimensions, namely a target address, stop words, English characters, Chinese character numbers, Arabic numbers, the number of rare characters and the number of characters among the stop words; the target address is address information with the word frequency higher than a preset word frequency threshold value and except stop words, English characters, Chinese characters and Arabic numerals; rarely-occurring words are address information with a word frequency lower than a preset word frequency threshold; stop words are words that characterize administrative divisions.
Illustratively, the stop words are words of administrative divisions such as province, city, district, county, street, cell, etc.; english characters are characters a, b, c and the like; the Chinese character figures are numbers of one, two, three, etc.; arabic numerals are 1, 2, 3, etc.
It should be noted that, for the classification dimensions of the target address, stop word, english character, chinese character number and arabic data, the first word is input into the preset neural network model for probability calculation, so the first word can be divided into five groups of words under the above five classification dimensions.
Further, in the embodiment of the application, the number of characters between every two adjacent word vectors in a group of word vectors corresponding to stop words is searched from the address information to be recognized, so that the number of a group of characters is obtained; screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; one statistical value is one statistical quantity of at least one word statistical quantity.
It can be known that, for seven classification dimensions, at least one group of word vectors may include five groups of word vectors corresponding to five classification dimensions, i.e., a target address, a stop word, an english character, a chinese character number, and arabic data, and at least one word statistical quantity may include two word statistical quantities corresponding to two classification dimensions, i.e., a quantity of second participles and a statistical value.
In this embodiment of the application, the number of characters of the preset number may be three, and the process of screening the number of characters of the preset number from a group of numbers of characters may include: the address recognition device sorts a group of characters according to the sequence from large to small, searches the number of the characters arranged at the first 3 from the sorted group of characters, and then weights the number of the characters at the first 3 according to a certain weight value to obtain a final value.
For example, the first 3 numbers of characters may correspond to weight values of 0.2, 0.7, 0.1.
S102, sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training.
In the embodiment of the application, the address recognition device inputs at least one group of word vectors into the preset neural network model, and predicts the association relationship between every two adjacent word vectors by using the preset neural network model aiming at each group of word vectors to obtain a group of association probabilities corresponding to each group of word vectors.
It should be noted that the preset length corresponding to the neural network model is preset. Before inputting at least one group of word vectors into the preset neural network model, the address recognition device firstly adjusts the length of each group of word vectors into a preset length.
Illustratively, the preset length is 20, and a group of word vectors with the length larger than 20 is truncated into 20; for a set of word vectors of length less than 20, the length is padded to 20 by adding a placeholder at the end.
It should be noted that the truncating operation may be to discard the part exceeding 20, or may be to transfer the part exceeding 20 to another group of word vectors, and a specific truncating operation may be selected according to actual situations, and the embodiment of the present application is not limited specifically.
In the embodiment of the application, the preset neural network model is a char-RNN network, wherein the char-RNN network comprises an embedding layer, an RNN neural network layer and a linear mapping layer.
In the embodiment of the present application, at least one group of word vectors is input into the embedding layer, and the input form thereof may be a vocabulary number, for example, [34, 56, 199, 500, 500] may represent [ beijing, hailake, garden, placeholder ], at the embedding layer, the [ b, s ] matrix is mapped to the [ b, s, l ] matrix, wherein b represents batch size, s represents sentence length, and l represents word vector length.
In the embodiment of the application, the RNN neural network layer is composed of two unidirectional LSTM layers, where 128 neurons in each LSTM layer.
In the embodiment of the application, the input dimension of the linear mapping layer is 128, in the linear mapping layer, 128 dimensions are mapped to the vocabulary dimension, and then the word vector is determined to which word vector the word vector is mapped through the softmax function.
In the embodiment of the application, the preset neural network model is obtained through unsupervised training, and specifically, the initial neural network model is subjected to unsupervised training by using real sample address information to obtain the preset neural network model.
It should be noted that, the real sample address information is divided into a training set and a verification set, an initial neural network model is trained by using a training street, the performance of the trained neural network model is verified by using the verification set, because the preset neural network model is obtained by unsupervised training, an index for measuring whether the performance of the preset neural network model is improved or not in the preset neural network model is a loss value on the verification set, wherein the loss value of the probability from a previous word to a next word is measured by a loss function, and the reduction of the loss value of the verification set represents the performance improvement of the preset neural network model.
S103, inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistic quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training.
In the embodiment of the application, the preset classification model is a Random Forest model.
In the embodiment of the application, the classification result of the preset classification model for the at least one group of association probability and the at least one word statistic number comprises a false address and a real address, or a scoring result for the at least one group of association probability and the at least one word statistic number, and the address recognition device judges the risk degree of the address information to be recognized according to the scoring result, and then recognizes whether the address information to be recognized is the real address or the false address.
In the embodiment of the application, the preset classification model is trained in a supervised training mode, specifically, real sample address information and false sample address information are processed based on a preset neural network model to obtain a sample output result; and then, dividing the sample output result into a training set and a testing set, and carrying out supervised training on the initial classification model based on the sample output result to obtain a preset classification model.
In the embodiment of the application, the training set is input into the initial classification model for training, then the trained classification model is subjected to classification test through the test set, and finally the preset classification model is obtained.
Exemplarily, as shown in fig. 2, the original address is subjected to data preprocessing, and then, the original address is divided into seven-dimensional participles, namely core address content, stop words such as provinces and the like, english characters, chinese characters, arabic numerals, the number of rare words and the length of stop inter-word characters, and the five-dimensional participles, namely the core address content, the stop words such as provinces and the like, english characters, chinese characters and arabic numerals, are input into char-RNN, and a probability vector 1, a probability vector 2, a probability vector 3, a probability vector 4 and a probability vector 5 are output; and inputting the probability vector 1, the probability vector 2, the probability vector 3, the probability vector 4, the probability vector 5, the number of rare words and the length of stop interword characters into a Radom Forest model to obtain a classification result of the original address. I.e. whether the original address is a real address or a dummy address.
It can be understood that a form of combining a preset neural network model obtained by unsupervised training and a preset classification model obtained by supervised training is adopted, the address information to be recognized is converted into word segmentation information of a text sequence mode according to preset classification dimensionality, the feature information in the address text can be subjected to correlation probability calculation by using the unsupervised training preset neural network model, then category recognition is performed by using the preset classification model, key features capable of distinguishing true and false addresses can be extracted, and accuracy of false address recognition is improved.
The embodiment of the application provides an address recognition device 1. As shown in fig. 3, the address recognition apparatus 1 includes:
the extraction unit 10 is configured to extract at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
the prediction unit 11 is configured to sequentially predict association probabilities between one word vector and an adjacent next word vector in each group of word vectors by using a preset neural network model, so as to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit 12 is configured to input at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, where the classification result is used to recognize authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
Optionally, the apparatus further comprises: a training unit;
the training unit is used for carrying out unsupervised training on the initial neural network model by utilizing real sample address information to obtain the preset neural network model; processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result; and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
Optionally, the apparatus further comprises: the system comprises a word segmentation unit, a statistic unit and a division unit;
the word segmentation unit is used for segmenting words of the address information to be identified to obtain word segmentation results;
the statistic unit is used for carrying out word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value;
the dividing unit is used for dividing the first word segmentation into at least one group of word segmentation according to the preset classification dimension, and performing word vector extraction on the at least one group of word segmentation to obtain at least one group of word vectors;
the counting unit is further configured to count the number of the second participles, where the number of the second participles is one of the at least one word counting number.
Optionally, the preset classification dimensions include target addresses, stop words, english characters, chinese numerals, arabic numerals, the number of rare words, and the number of characters between stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
Optionally, the apparatus further comprises: a searching unit and a screening unit;
the searching unit is used for searching the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop words from the address information to be identified to obtain a group of character numbers;
the screening unit is used for screening the number of characters with preset number from a group of character numbers and processing the number of characters with preset number to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
Optionally, the preset neural network model is a character-level recurrent neural network Char-RNN model.
Optionally, the preset classification model is a random forest model.
According to the address identification device provided by the embodiment of the application, at least one group of word vectors and at least one word statistical quantity are extracted from address information to be identified according to the preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training. Therefore, the address recognition device provided by the embodiment adopts a combination form of the preset neural network model obtained by unsupervised training and the preset classification model obtained by supervised training, converts the address information to be recognized into word segmentation information of a text sequence mode according to the preset classification dimensionality, calculates the association probability of the feature information in the address text by using the unsupervised training preset neural network model, performs category recognition by using the preset classification model, can extract the key features capable of distinguishing true and false addresses, and further improves the accuracy of false address recognition.
Fig. 4 is a schematic diagram of a composition structure of an address recognition apparatus 1 according to an embodiment of the present application, and in practical applications, based on the same disclosure concept of the foregoing embodiment, as shown in fig. 4, the address recognition apparatus 1 according to the present embodiment includes: a processor 13, a memory 14, and a communication bus 15.
In a Specific embodiment, the extracting unit 10, the predicting unit 11, the classifying unit 12, the training unit, the word segmentation unit, the statistical unit, the dividing unit, the searching unit and the screening unit may be implemented by a Processor 13 located on the terminal 1, and the Processor 13 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a CPU, a controller, a microcontroller and a microprocessor. It is understood that the electronic device for implementing the above-mentioned processor function may be other devices, and the embodiment is not limited in particular.
In the embodiment of the present application, the communication bus 15 is used for realizing connection communication between the processor 13 and the memory 14; the processor 13 implements the following address recognition method when executing the execution program stored in the memory 14:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
Further, the processor 13 is further configured to perform unsupervised training on the initial neural network model by using the real sample address information to obtain the preset neural network model; processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result; and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
Further, the processor 13 is further configured to perform word segmentation on the address information to be identified to obtain a word segmentation result; performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value; dividing the first participle into at least one group of participles according to the preset classification dimension, and extracting word vectors of the at least one group of participles to obtain at least one group of word vectors; and counting the number of the second participles, wherein the number of the second participles is one of the at least one word counting number.
Further, the preset classification dimension comprises a target address, stop words, English characters, Chinese character numbers, Arabic numerals, the number of rare characters and the number of characters among the stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
Further, the processor 13 is further configured to search, from the address information to be identified, the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop word, so as to obtain a group of character numbers; screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
Further, the preset neural network model is a character-level recurrent neural network Char-RNN model.
Further, the preset classification model is a random forest model.
The embodiment of the application provides a storage medium, on which a computer program is stored, the computer readable storage medium stores one or more programs, the one or more programs are executable by one or more processors and are applied to an address recognition device, and the computer program implements the address recognition method.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling an image display device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present disclosure.
The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.
Claims (10)
1. An address identification method, the method comprising:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
2. The method according to claim 1, wherein before extracting at least one group of word vectors and at least one word statistic from the address information to be recognized according to the preset classification dimension, the method further comprises:
carrying out unsupervised training on the initial neural network model by utilizing real sample address information to obtain the preset neural network model;
processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result;
and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
3. The method according to claim 1, wherein before extracting at least one group of word vectors and at least one word statistic from the address information to be recognized according to the preset classification dimension, the method further comprises:
performing word segmentation on the address information to be recognized to obtain word segmentation results;
performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value;
correspondingly, the extracting at least one group of word vectors and at least one word statistic number from the address information to be recognized according to the preset classification dimension includes:
dividing the first participle into at least one group of participles according to the preset classification dimension, and extracting word vectors of the at least one group of participles to obtain at least one group of word vectors;
and counting the number of the second participles, wherein the number of the second participles is one of the at least one word counting number.
4. The method of claim 1 or 3, wherein the preset classification dimensions include target addresses, stop words, English characters, Chinese numerals, Arabic numerals, the number of rare words, and the number of characters between stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
5. The method of claim 4, further comprising:
searching the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop words from the address information to be recognized to obtain the number of a group of characters;
screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
6. The method of claim 1, wherein the predetermined neural network model is a charlevel recurrent neural network (Char-RNN) model.
7. The method of claim 1, wherein the preset classification model is a random forest model.
8. An address identification apparatus, the apparatus comprising:
the extraction unit is used for extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to the preset classification dimension;
the prediction unit is used for sequentially predicting the association probability between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit is used for inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, and the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
9. An address identification apparatus, the apparatus comprising: a processor, a memory, and a communication bus; the processor, when executing the execution program stored in the memory, implements the method of any of claims 1-7.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856358.4A CN113609290A (en) | 2021-07-28 | 2021-07-28 | Address recognition method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110856358.4A CN113609290A (en) | 2021-07-28 | 2021-07-28 | Address recognition method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113609290A true CN113609290A (en) | 2021-11-05 |
Family
ID=78338502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110856358.4A Pending CN113609290A (en) | 2021-07-28 | 2021-07-28 | Address recognition method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113609290A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114154501A (en) * | 2022-02-09 | 2022-03-08 | 南京擎天科技有限公司 | Chinese address word segmentation method and system based on unsupervised learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0747334A (en) * | 1993-08-06 | 1995-02-21 | Toshiba Corp | Address recognizing device and address reading and classifying machine |
KR20000001316A (en) * | 1998-06-10 | 2000-01-15 | 윤종용 | Device for processing receipt message of radio paging receiver |
CN107066478A (en) * | 2016-12-14 | 2017-08-18 | 阿里巴巴集团控股有限公司 | A kind of method and device of address dummy information identification |
CN107992501A (en) * | 2016-10-27 | 2018-05-04 | 腾讯科技(深圳)有限公司 | Social network information recognition methods, processing method and processing device |
CN110197284A (en) * | 2019-04-30 | 2019-09-03 | 腾讯科技(深圳)有限公司 | A kind of address dummy recognition methods, device and equipment |
CN110442856A (en) * | 2019-06-14 | 2019-11-12 | 平安科技(深圳)有限公司 | A kind of address information standardized method, device, computer equipment and storage medium |
KR102144044B1 (en) * | 2020-01-21 | 2020-08-12 | 엘아이지넥스원 주식회사 | Apparatus and method for classification of true and false positivies of weapon system software static testing based on machine learning |
-
2021
- 2021-07-28 CN CN202110856358.4A patent/CN113609290A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0747334A (en) * | 1993-08-06 | 1995-02-21 | Toshiba Corp | Address recognizing device and address reading and classifying machine |
KR20000001316A (en) * | 1998-06-10 | 2000-01-15 | 윤종용 | Device for processing receipt message of radio paging receiver |
CN107992501A (en) * | 2016-10-27 | 2018-05-04 | 腾讯科技(深圳)有限公司 | Social network information recognition methods, processing method and processing device |
CN107066478A (en) * | 2016-12-14 | 2017-08-18 | 阿里巴巴集团控股有限公司 | A kind of method and device of address dummy information identification |
CN110197284A (en) * | 2019-04-30 | 2019-09-03 | 腾讯科技(深圳)有限公司 | A kind of address dummy recognition methods, device and equipment |
CN110442856A (en) * | 2019-06-14 | 2019-11-12 | 平安科技(深圳)有限公司 | A kind of address information standardized method, device, computer equipment and storage medium |
KR102144044B1 (en) * | 2020-01-21 | 2020-08-12 | 엘아이지넥스원 주식회사 | Apparatus and method for classification of true and false positivies of weapon system software static testing based on machine learning |
Non-Patent Citations (1)
Title |
---|
皮琪;王文杰;杨飞;赵耀;: "基于深度学习的虚假评论识别", 网络新媒体技术, vol. 5, no. 06, 15 November 2016 (2016-11-15), pages 30 - 33 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114154501A (en) * | 2022-02-09 | 2022-03-08 | 南京擎天科技有限公司 | Chinese address word segmentation method and system based on unsupervised learning |
CN114154501B (en) * | 2022-02-09 | 2022-04-26 | 南京擎天科技有限公司 | Chinese address word segmentation method and system based on unsupervised learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109302410B (en) | Method and system for detecting abnormal behavior of internal user and computer storage medium | |
Yuan et al. | Malicious URL detection based on a parallel neural joint model | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN109359439A (en) | Software detecting method, device, equipment and storage medium | |
CN111881983A (en) | Data processing method and device based on classification model, electronic equipment and medium | |
CN111460820A (en) | Network space security domain named entity recognition method and device based on pre-training model BERT | |
CN111915437A (en) | RNN-based anti-money laundering model training method, device, equipment and medium | |
CN111177367B (en) | Case classification method, classification model training method and related products | |
CN112347367A (en) | Information service providing method, information service providing device, electronic equipment and storage medium | |
CN108875727B (en) | The detection method and device of graph-text identification, storage medium, processor | |
CN109165529B (en) | Dark chain tampering detection method and device and computer readable storage medium | |
CN113742733B (en) | Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type | |
CN115473726B (en) | Domain name identification method and device | |
CN112528894A (en) | Method and device for distinguishing difference items | |
CN112733140B (en) | Detection method and system for model inclination attack | |
CN111753290A (en) | Software type detection method and related equipment | |
CN111967503A (en) | Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method | |
CN111488574B (en) | Malicious software classification method, system, computer equipment and storage medium | |
CN115510500A (en) | Sensitive analysis method and system for text content | |
CN115358340A (en) | Credit credit collection short message distinguishing method, system, equipment and storage medium | |
CN113783852B (en) | Intelligent contract Pompe fraudster detection algorithm based on neural network | |
CN113609290A (en) | Address recognition method and device and storage medium | |
Rahman et al. | An efficient deep learning technique for bangla fake news detection | |
CN105808602A (en) | Detection method and device of junk information | |
Paik et al. | Malware family prediction with an awareness of label uncertainty |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |