CN113609290A - Address recognition method and device and storage medium - Google Patents

Address recognition method and device and storage medium Download PDF

Info

Publication number
CN113609290A
CN113609290A CN202110856358.4A CN202110856358A CN113609290A CN 113609290 A CN113609290 A CN 113609290A CN 202110856358 A CN202110856358 A CN 202110856358A CN 113609290 A CN113609290 A CN 113609290A
Authority
CN
China
Prior art keywords
word
group
preset
address information
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110856358.4A
Other languages
Chinese (zh)
Inventor
林元晟
王仲琪
崔文谦
李若昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110856358.4A priority Critical patent/CN113609290A/en
Publication of CN113609290A publication Critical patent/CN113609290A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Finance (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Accounting & Taxation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)

Abstract

The embodiment of the application provides an address identification method and device and a storage medium, and the method comprises the following steps: extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training.

Description

Address recognition method and device and storage medium
Technical Field
The present application relates to the field of electronic applications, and in particular, to an address identification method and apparatus, and a storage medium.
Background
The network black products have posed a great threat to e-commerce platforms, and attackers from the black products register account numbers mainly through each large e-commerce platform, and use the account numbers to carry out cheating behaviors such as assisting merchants to swipe bills and swizzle credits, so that impact is caused to an evaluation system of e-commerce, and benefits of the e-commerce platform, merchants operating normally and buyers are damaged. Particularly, in the logistics distribution link, a large number of network black products influence the normal ecology of the e-commerce industry, the logistics distribution industry and the local life industry through actions of order brushing or malicious order placement and the like. In order to exploit the risk of address cheating therein, it is highly desirable to be able to identify false addresses in trade orders.
However, the current rule-based risk scoring technology needs to determine rules based on expert business experience, which results in slow updating, small coverage and high possibility of missed judgment; therefore, a risk scoring method based on a supervised classification technology is provided, however, the false addresses are continuously changed, and the number of the real address samples is far greater than that of the false address samples, so that the false addresses cannot be accurately identified for complex false address detection.
Disclosure of Invention
The embodiment of the application provides an address identification method and device and a storage medium, which can improve the accuracy of identifying a false address.
The technical scheme of the application is realized as follows:
in a first aspect, an embodiment of the present application provides an address identification method, where the method includes:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
In a second aspect, an embodiment of the present application provides an address identification apparatus, including:
the extraction unit is used for extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to the preset classification dimension;
the prediction unit is used for sequentially predicting the association probability between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit is used for inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, and the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
In a third aspect, an embodiment of the present application provides an address identification apparatus, where the apparatus includes: a processor, a memory, and a communication bus; the processor implements the method as described above when executing the running program stored in the memory.
In a fourth aspect, an embodiment of the present application provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method as described above.
The embodiment of the application provides an address identification method and device and a storage medium, wherein the method comprises the following steps: extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training. By adopting the implementation scheme, the combination form of the preset neural network model obtained by unsupervised training and the preset classification model obtained by supervised training is adopted, the address information to be recognized is converted into word segmentation information of a text sequence mode according to the preset classification dimensionality, the feature information in the address text can be subjected to association probability calculation by using the unsupervised training preset neural network model, then the class recognition is performed by using the preset classification model, the key features capable of distinguishing true and false addresses can be extracted, and the accuracy of false address recognition is improved.
Drawings
Fig. 1 is a flowchart of an address identification method according to an embodiment of the present application;
FIG. 2 is a block diagram illustrating an exemplary address identification process provided by an embodiment of the present application;
fig. 3 is a first schematic structural diagram of an address identification apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an address identification device according to an embodiment of the present application.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the present application. And are not intended to limit the present application.
An embodiment of the present application provides an address identification method, as shown in fig. 1, the method may include:
s101, extracting at least one group of word vectors and at least one word statistic number from the address information to be recognized according to preset classification dimensions.
The address identification method provided by the embodiment of the application is suitable for a scene of carrying out false address identification on addresses in trade orders on an e-commerce platform, a logistics distribution platform and a local life platform.
In the embodiment of the present application, the device for performing address recognition may be any device having data processing and storing functions, for example: tablet computers, mobile phones, Personal Computers (PCs), notebook computers, wearable devices, and the like.
In the embodiment of the application, corresponding address information to be recognized is obtained from a transaction order, data preprocessing such as word segmentation, word frequency statistics and the like is firstly carried out on the address information to be recognized, specifically, word segmentation is carried out on the address information to be recognized, and a word segmentation result is obtained; performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value. And then, extracting at least one group of word vectors and at least one word statistical quantity from the preprocessed address information to be recognized according to a preset classification dimension.
It should be noted that, in the embodiment of the present application, word segmentation may be performed on address information to be recognized in a word segmentation manner such as jieba word segmentation, and a specific word segmentation manner may be selected according to an actual situation, which is not specifically limited in the embodiment of the present application.
Specifically, the address recognition device divides the first word into at least one group of words according to a preset classification dimension, and performs word vector extraction on the at least one group of words to obtain at least one group of word vectors; the address recognition device counts the number of second participles, wherein the number of the second participles is one statistical number in at least one word statistical number.
It should be noted that, in the embodiment of the present application, word vector extraction may be performed on the first segmentation by using a souguo pre-training word vector. The specific word vector extraction method can be selected according to actual conditions, and the embodiment of the present application is not particularly limited.
It can be understood that word vector extraction is performed on the first word with the word frequency higher than the preset word frequency threshold value, and then the first word is input into the preset neural network model for probability calculation, so that the vocabulary dimension of the preset neural network model can be reduced, the processing data volume of the preset neural network model is greatly reduced, and the processing speed of the preset neural network model is improved.
In the embodiment of the application, the preset classification dimensions comprise seven classification dimensions, namely a target address, stop words, English characters, Chinese character numbers, Arabic numbers, the number of rare characters and the number of characters among the stop words; the target address is address information with the word frequency higher than a preset word frequency threshold value and except stop words, English characters, Chinese characters and Arabic numerals; rarely-occurring words are address information with a word frequency lower than a preset word frequency threshold; stop words are words that characterize administrative divisions.
Illustratively, the stop words are words of administrative divisions such as province, city, district, county, street, cell, etc.; english characters are characters a, b, c and the like; the Chinese character figures are numbers of one, two, three, etc.; arabic numerals are 1, 2, 3, etc.
It should be noted that, for the classification dimensions of the target address, stop word, english character, chinese character number and arabic data, the first word is input into the preset neural network model for probability calculation, so the first word can be divided into five groups of words under the above five classification dimensions.
Further, in the embodiment of the application, the number of characters between every two adjacent word vectors in a group of word vectors corresponding to stop words is searched from the address information to be recognized, so that the number of a group of characters is obtained; screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; one statistical value is one statistical quantity of at least one word statistical quantity.
It can be known that, for seven classification dimensions, at least one group of word vectors may include five groups of word vectors corresponding to five classification dimensions, i.e., a target address, a stop word, an english character, a chinese character number, and arabic data, and at least one word statistical quantity may include two word statistical quantities corresponding to two classification dimensions, i.e., a quantity of second participles and a statistical value.
In this embodiment of the application, the number of characters of the preset number may be three, and the process of screening the number of characters of the preset number from a group of numbers of characters may include: the address recognition device sorts a group of characters according to the sequence from large to small, searches the number of the characters arranged at the first 3 from the sorted group of characters, and then weights the number of the characters at the first 3 according to a certain weight value to obtain a final value.
For example, the first 3 numbers of characters may correspond to weight values of 0.2, 0.7, 0.1.
S102, sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training.
In the embodiment of the application, the address recognition device inputs at least one group of word vectors into the preset neural network model, and predicts the association relationship between every two adjacent word vectors by using the preset neural network model aiming at each group of word vectors to obtain a group of association probabilities corresponding to each group of word vectors.
It should be noted that the preset length corresponding to the neural network model is preset. Before inputting at least one group of word vectors into the preset neural network model, the address recognition device firstly adjusts the length of each group of word vectors into a preset length.
Illustratively, the preset length is 20, and a group of word vectors with the length larger than 20 is truncated into 20; for a set of word vectors of length less than 20, the length is padded to 20 by adding a placeholder at the end.
It should be noted that the truncating operation may be to discard the part exceeding 20, or may be to transfer the part exceeding 20 to another group of word vectors, and a specific truncating operation may be selected according to actual situations, and the embodiment of the present application is not limited specifically.
In the embodiment of the application, the preset neural network model is a char-RNN network, wherein the char-RNN network comprises an embedding layer, an RNN neural network layer and a linear mapping layer.
In the embodiment of the present application, at least one group of word vectors is input into the embedding layer, and the input form thereof may be a vocabulary number, for example, [34, 56, 199, 500, 500] may represent [ beijing, hailake, garden, placeholder ], at the embedding layer, the [ b, s ] matrix is mapped to the [ b, s, l ] matrix, wherein b represents batch size, s represents sentence length, and l represents word vector length.
In the embodiment of the application, the RNN neural network layer is composed of two unidirectional LSTM layers, where 128 neurons in each LSTM layer.
In the embodiment of the application, the input dimension of the linear mapping layer is 128, in the linear mapping layer, 128 dimensions are mapped to the vocabulary dimension, and then the word vector is determined to which word vector the word vector is mapped through the softmax function.
In the embodiment of the application, the preset neural network model is obtained through unsupervised training, and specifically, the initial neural network model is subjected to unsupervised training by using real sample address information to obtain the preset neural network model.
It should be noted that, the real sample address information is divided into a training set and a verification set, an initial neural network model is trained by using a training street, the performance of the trained neural network model is verified by using the verification set, because the preset neural network model is obtained by unsupervised training, an index for measuring whether the performance of the preset neural network model is improved or not in the preset neural network model is a loss value on the verification set, wherein the loss value of the probability from a previous word to a next word is measured by a loss function, and the reduction of the loss value of the verification set represents the performance improvement of the preset neural network model.
S103, inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistic quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training.
In the embodiment of the application, the preset classification model is a Random Forest model.
In the embodiment of the application, the classification result of the preset classification model for the at least one group of association probability and the at least one word statistic number comprises a false address and a real address, or a scoring result for the at least one group of association probability and the at least one word statistic number, and the address recognition device judges the risk degree of the address information to be recognized according to the scoring result, and then recognizes whether the address information to be recognized is the real address or the false address.
In the embodiment of the application, the preset classification model is trained in a supervised training mode, specifically, real sample address information and false sample address information are processed based on a preset neural network model to obtain a sample output result; and then, dividing the sample output result into a training set and a testing set, and carrying out supervised training on the initial classification model based on the sample output result to obtain a preset classification model.
In the embodiment of the application, the training set is input into the initial classification model for training, then the trained classification model is subjected to classification test through the test set, and finally the preset classification model is obtained.
Exemplarily, as shown in fig. 2, the original address is subjected to data preprocessing, and then, the original address is divided into seven-dimensional participles, namely core address content, stop words such as provinces and the like, english characters, chinese characters, arabic numerals, the number of rare words and the length of stop inter-word characters, and the five-dimensional participles, namely the core address content, the stop words such as provinces and the like, english characters, chinese characters and arabic numerals, are input into char-RNN, and a probability vector 1, a probability vector 2, a probability vector 3, a probability vector 4 and a probability vector 5 are output; and inputting the probability vector 1, the probability vector 2, the probability vector 3, the probability vector 4, the probability vector 5, the number of rare words and the length of stop interword characters into a Radom Forest model to obtain a classification result of the original address. I.e. whether the original address is a real address or a dummy address.
It can be understood that a form of combining a preset neural network model obtained by unsupervised training and a preset classification model obtained by supervised training is adopted, the address information to be recognized is converted into word segmentation information of a text sequence mode according to preset classification dimensionality, the feature information in the address text can be subjected to correlation probability calculation by using the unsupervised training preset neural network model, then category recognition is performed by using the preset classification model, key features capable of distinguishing true and false addresses can be extracted, and accuracy of false address recognition is improved.
The embodiment of the application provides an address recognition device 1. As shown in fig. 3, the address recognition apparatus 1 includes:
the extraction unit 10 is configured to extract at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
the prediction unit 11 is configured to sequentially predict association probabilities between one word vector and an adjacent next word vector in each group of word vectors by using a preset neural network model, so as to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit 12 is configured to input at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, where the classification result is used to recognize authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
Optionally, the apparatus further comprises: a training unit;
the training unit is used for carrying out unsupervised training on the initial neural network model by utilizing real sample address information to obtain the preset neural network model; processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result; and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
Optionally, the apparatus further comprises: the system comprises a word segmentation unit, a statistic unit and a division unit;
the word segmentation unit is used for segmenting words of the address information to be identified to obtain word segmentation results;
the statistic unit is used for carrying out word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value;
the dividing unit is used for dividing the first word segmentation into at least one group of word segmentation according to the preset classification dimension, and performing word vector extraction on the at least one group of word segmentation to obtain at least one group of word vectors;
the counting unit is further configured to count the number of the second participles, where the number of the second participles is one of the at least one word counting number.
Optionally, the preset classification dimensions include target addresses, stop words, english characters, chinese numerals, arabic numerals, the number of rare words, and the number of characters between stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
Optionally, the apparatus further comprises: a searching unit and a screening unit;
the searching unit is used for searching the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop words from the address information to be identified to obtain a group of character numbers;
the screening unit is used for screening the number of characters with preset number from a group of character numbers and processing the number of characters with preset number to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
Optionally, the preset neural network model is a character-level recurrent neural network Char-RNN model.
Optionally, the preset classification model is a random forest model.
According to the address identification device provided by the embodiment of the application, at least one group of word vectors and at least one word statistical quantity are extracted from address information to be identified according to the preset classification dimension; sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; presetting a neural network model as a model obtained through unsupervised training; inputting at least one group of association probabilities corresponding to at least one group of word vectors and at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is a model obtained through supervised training. Therefore, the address recognition device provided by the embodiment adopts a combination form of the preset neural network model obtained by unsupervised training and the preset classification model obtained by supervised training, converts the address information to be recognized into word segmentation information of a text sequence mode according to the preset classification dimensionality, calculates the association probability of the feature information in the address text by using the unsupervised training preset neural network model, performs category recognition by using the preset classification model, can extract the key features capable of distinguishing true and false addresses, and further improves the accuracy of false address recognition.
Fig. 4 is a schematic diagram of a composition structure of an address recognition apparatus 1 according to an embodiment of the present application, and in practical applications, based on the same disclosure concept of the foregoing embodiment, as shown in fig. 4, the address recognition apparatus 1 according to the present embodiment includes: a processor 13, a memory 14, and a communication bus 15.
In a Specific embodiment, the extracting unit 10, the predicting unit 11, the classifying unit 12, the training unit, the word segmentation unit, the statistical unit, the dividing unit, the searching unit and the screening unit may be implemented by a Processor 13 located on the terminal 1, and the Processor 13 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a CPU, a controller, a microcontroller and a microprocessor. It is understood that the electronic device for implementing the above-mentioned processor function may be other devices, and the embodiment is not limited in particular.
In the embodiment of the present application, the communication bus 15 is used for realizing connection communication between the processor 13 and the memory 14; the processor 13 implements the following address recognition method when executing the execution program stored in the memory 14:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
Further, the processor 13 is further configured to perform unsupervised training on the initial neural network model by using the real sample address information to obtain the preset neural network model; processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result; and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
Further, the processor 13 is further configured to perform word segmentation on the address information to be identified to obtain a word segmentation result; performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value; dividing the first participle into at least one group of participles according to the preset classification dimension, and extracting word vectors of the at least one group of participles to obtain at least one group of word vectors; and counting the number of the second participles, wherein the number of the second participles is one of the at least one word counting number.
Further, the preset classification dimension comprises a target address, stop words, English characters, Chinese character numbers, Arabic numerals, the number of rare characters and the number of characters among the stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
Further, the processor 13 is further configured to search, from the address information to be identified, the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop word, so as to obtain a group of character numbers; screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
Further, the preset neural network model is a character-level recurrent neural network Char-RNN model.
Further, the preset classification model is a random forest model.
The embodiment of the application provides a storage medium, on which a computer program is stored, the computer readable storage medium stores one or more programs, the one or more programs are executable by one or more processors and are applied to an address recognition device, and the computer program implements the address recognition method.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling an image display device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present disclosure.
The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims (10)

1. An address identification method, the method comprising:
extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to a preset classification dimension;
sequentially predicting association probabilities between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, wherein the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
2. The method according to claim 1, wherein before extracting at least one group of word vectors and at least one word statistic from the address information to be recognized according to the preset classification dimension, the method further comprises:
carrying out unsupervised training on the initial neural network model by utilizing real sample address information to obtain the preset neural network model;
processing the real sample address information and the false sample address information based on the preset neural network model to obtain a sample output result;
and carrying out supervised training on the initial classification model based on the sample output result to obtain the preset classification model.
3. The method according to claim 1, wherein before extracting at least one group of word vectors and at least one word statistic from the address information to be recognized according to the preset classification dimension, the method further comprises:
performing word segmentation on the address information to be recognized to obtain word segmentation results;
performing word frequency statistics on the word segmentation result to obtain a first word segmentation and a second word segmentation; the word frequency of the first participle is higher than a preset word frequency threshold value, and the word frequency of the second participle is lower than the preset word frequency threshold value;
correspondingly, the extracting at least one group of word vectors and at least one word statistic number from the address information to be recognized according to the preset classification dimension includes:
dividing the first participle into at least one group of participles according to the preset classification dimension, and extracting word vectors of the at least one group of participles to obtain at least one group of word vectors;
and counting the number of the second participles, wherein the number of the second participles is one of the at least one word counting number.
4. The method of claim 1 or 3, wherein the preset classification dimensions include target addresses, stop words, English characters, Chinese numerals, Arabic numerals, the number of rare words, and the number of characters between stop words; the target address is address information, wherein the word frequency is higher than a preset word frequency threshold value, and the address information is except stop words, English characters, Chinese characters and Arabic numerals; the rare words are address information with the word frequency lower than a preset word frequency threshold value; the stop words are words characterizing administrative divisions.
5. The method of claim 4, further comprising:
searching the number of characters between every two adjacent word vectors in a group of word vectors corresponding to the stop words from the address information to be recognized to obtain the number of a group of characters;
screening a preset number of characters from a group of characters, and processing the preset number of characters to obtain a statistical value; the one statistical value is one statistical quantity of the at least one word statistical quantity.
6. The method of claim 1, wherein the predetermined neural network model is a charlevel recurrent neural network (Char-RNN) model.
7. The method of claim 1, wherein the preset classification model is a random forest model.
8. An address identification apparatus, the apparatus comprising:
the extraction unit is used for extracting at least one group of word vectors and at least one word statistical quantity from the address information to be identified according to the preset classification dimension;
the prediction unit is used for sequentially predicting the association probability between one word vector and the next adjacent word vector in each group of word vectors by using a preset neural network model to obtain a group of association probabilities corresponding to each group of word vectors; the preset neural network model is obtained through unsupervised training;
the classification unit is used for inputting at least one group of association probabilities corresponding to the at least one group of word vectors and the at least one word statistical quantity into a preset classification model to obtain a classification result corresponding to the address information to be recognized, and the classification result is used for recognizing the authenticity of the address information to be recognized; the preset classification model is obtained through supervised training.
9. An address identification apparatus, the apparatus comprising: a processor, a memory, and a communication bus; the processor, when executing the execution program stored in the memory, implements the method of any of claims 1-7.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110856358.4A 2021-07-28 2021-07-28 Address recognition method and device and storage medium Pending CN113609290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110856358.4A CN113609290A (en) 2021-07-28 2021-07-28 Address recognition method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110856358.4A CN113609290A (en) 2021-07-28 2021-07-28 Address recognition method and device and storage medium

Publications (1)

Publication Number Publication Date
CN113609290A true CN113609290A (en) 2021-11-05

Family

ID=78338502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110856358.4A Pending CN113609290A (en) 2021-07-28 2021-07-28 Address recognition method and device and storage medium

Country Status (1)

Country Link
CN (1) CN113609290A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154501A (en) * 2022-02-09 2022-03-08 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0747334A (en) * 1993-08-06 1995-02-21 Toshiba Corp Address recognizing device and address reading and classifying machine
KR20000001316A (en) * 1998-06-10 2000-01-15 윤종용 Device for processing receipt message of radio paging receiver
CN107066478A (en) * 2016-12-14 2017-08-18 阿里巴巴集团控股有限公司 A kind of method and device of address dummy information identification
CN107992501A (en) * 2016-10-27 2018-05-04 腾讯科技(深圳)有限公司 Social network information recognition methods, processing method and processing device
CN110197284A (en) * 2019-04-30 2019-09-03 腾讯科技(深圳)有限公司 A kind of address dummy recognition methods, device and equipment
CN110442856A (en) * 2019-06-14 2019-11-12 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
KR102144044B1 (en) * 2020-01-21 2020-08-12 엘아이지넥스원 주식회사 Apparatus and method for classification of true and false positivies of weapon system software static testing based on machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0747334A (en) * 1993-08-06 1995-02-21 Toshiba Corp Address recognizing device and address reading and classifying machine
KR20000001316A (en) * 1998-06-10 2000-01-15 윤종용 Device for processing receipt message of radio paging receiver
CN107992501A (en) * 2016-10-27 2018-05-04 腾讯科技(深圳)有限公司 Social network information recognition methods, processing method and processing device
CN107066478A (en) * 2016-12-14 2017-08-18 阿里巴巴集团控股有限公司 A kind of method and device of address dummy information identification
CN110197284A (en) * 2019-04-30 2019-09-03 腾讯科技(深圳)有限公司 A kind of address dummy recognition methods, device and equipment
CN110442856A (en) * 2019-06-14 2019-11-12 平安科技(深圳)有限公司 A kind of address information standardized method, device, computer equipment and storage medium
KR102144044B1 (en) * 2020-01-21 2020-08-12 엘아이지넥스원 주식회사 Apparatus and method for classification of true and false positivies of weapon system software static testing based on machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
皮琪;王文杰;杨飞;赵耀;: "基于深度学习的虚假评论识别", 网络新媒体技术, vol. 5, no. 06, 15 November 2016 (2016-11-15), pages 30 - 33 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154501A (en) * 2022-02-09 2022-03-08 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning
CN114154501B (en) * 2022-02-09 2022-04-26 南京擎天科技有限公司 Chinese address word segmentation method and system based on unsupervised learning

Similar Documents

Publication Publication Date Title
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
Yuan et al. Malicious URL detection based on a parallel neural joint model
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN109359439A (en) Software detecting method, device, equipment and storage medium
CN111881983A (en) Data processing method and device based on classification model, electronic equipment and medium
CN111460820A (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN111915437A (en) RNN-based anti-money laundering model training method, device, equipment and medium
CN111177367B (en) Case classification method, classification model training method and related products
CN112347367A (en) Information service providing method, information service providing device, electronic equipment and storage medium
CN108875727B (en) The detection method and device of graph-text identification, storage medium, processor
CN109165529B (en) Dark chain tampering detection method and device and computer readable storage medium
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN115473726B (en) Domain name identification method and device
CN112528894A (en) Method and device for distinguishing difference items
CN112733140B (en) Detection method and system for model inclination attack
CN111753290A (en) Software type detection method and related equipment
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN111488574B (en) Malicious software classification method, system, computer equipment and storage medium
CN115510500A (en) Sensitive analysis method and system for text content
CN115358340A (en) Credit credit collection short message distinguishing method, system, equipment and storage medium
CN113783852B (en) Intelligent contract Pompe fraudster detection algorithm based on neural network
CN113609290A (en) Address recognition method and device and storage medium
Rahman et al. An efficient deep learning technique for bangla fake news detection
CN105808602A (en) Detection method and device of junk information
Paik et al. Malware family prediction with an awareness of label uncertainty

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination