CN107038452A - Telephone number recognition methods and device - Google Patents

Telephone number recognition methods and device Download PDF

Info

Publication number
CN107038452A
CN107038452A CN201710001599.4A CN201710001599A CN107038452A CN 107038452 A CN107038452 A CN 107038452A CN 201710001599 A CN201710001599 A CN 201710001599A CN 107038452 A CN107038452 A CN 107038452A
Authority
CN
China
Prior art keywords
phone number
number section
telephone number
section
detected
Prior art date
Application number
CN201710001599.4A
Other languages
Chinese (zh)
Inventor
张杨
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN201610019509X priority Critical
Priority to CN201610019509 priority
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of CN107038452A publication Critical patent/CN107038452A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/62Methods or arrangements for recognition using electronic means
    • G06K9/6201Matching; Proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

This application discloses a kind of telephone number recognition methods and device, wherein this method includes:Phone number section in target data to be detected is searched according to deterministic finite state automata, the deterministic finite state automata is constructed according to predetermined telephone number section and phone number section variant;To the phone number section found, telephone number is gone out by telephone number normal form match cognization.The application can in target data high speed to be detected, accurately detect the telephone number of telephone number and various variants.

Description

Telephone number recognition methods and device

Technical field

The application is related to technical field of the computer network, more particularly to telephone number recognition methods and device.

Background technology

There are keyword filtration system in major Internet firms, and which part company has detects phone number in text Whether demand, number detection is generally applied in combination with other keywords, complete transaction can be facilitated to carry out risk assessment.Example Such as, on current many social forums, personal website, the letter sold illegal contraband and sex service is provided often occurs Breath, and leave it is various normally with the phone number of variant (such as 18810450382,1. 8. 8 1.=o 8. 4. (4) (5) 0=3), it is this In the case of need to detect these phone numbers in webpage and extract.

The extraction for phone number in webpage is mainly based upon regular expression at present, by the specific extraction of phone number Rule is converted into regular expression, however, this method has following deficiency:

1st, committed memory is big, and operational efficiency is low

In essence, regular expression engine can generally be divided into two classes:Deterministic finite state automata (Deterministic Finite Automaton, DFA) engine and Non-Deterministic Finite State automatic machine (NFA) engine.Pin A large amount of phone numbers are constructed with deterministic finite state automata needs to take more internal memory, and matching speed is very fast;Rather than really Qualitative finite-state automata is backtracking engine, can handle more complicated regular expression, but matching speed is compared with certainty Finite-state automata is slow.

2nd, matching precision is poor, be not easy to deal with the situation of substantial amounts of deformation phone number

Regular expression is difficult write very accurate.Only mobile phone number section matching one, already present more than 100 at present, The systematicness of numeral is not strong enough, it is difficult to accurately match these number sections name in regular expression, can only handle some relative rule Then simple phone number, fault-tolerance not enough, for the deformation phone number in a large amount of illegal web pages (such as:188=1O45= 384) can not correctly it identify.

3rd, with security breaches

If regular expression is detected to outward leakage or by the external world, the external world i.e. can construct one can evade work as The phone number of preceding expression formula.

The content of the invention

The embodiment of the present application provides a kind of telephone number recognition methods, to high speed, accurately detects telephone number and each The telephone number of variant is planted, this method includes:

Obtain phone number section and phone number section variant;

According to phone number section and phone number section variant construction deterministic finite state automata;

Phone number section in target data to be detected is searched according to the deterministic finite state automata;

To the phone number section found, telephone number is gone out by telephone number normal form match cognization.

The embodiment of the present application also provides a kind of telephone number recognition device, at a high speed, accurately detect telephone number and The telephone number of various variants, the device includes:

Number section acquisition module, for obtaining phone number section and phone number section variant;

Automatic machine constructing module, it is automatic for constructing deterministic finite state according to phone number section and phone number section variant Machine;

Number section searching modul, for searching the electricity in target data to be detected according to the deterministic finite state automata Talk about number section;

Number identification module, for the phone number section to finding, phone number is gone out by telephone number normal form match cognization Code.

The embodiment of the present application provides a kind of telephone number recognition methods, to high speed, accurately detects telephone number and each The telephone number of variant is planted, this method includes:

Phone number section in target data to be detected is searched according to deterministic finite state automata, the certainty is limited State automata is constructed according to predetermined telephone number section and phone number section variant;

To the phone number section found, telephone number is gone out by telephone number normal form match cognization.

The embodiment of the present application also provides a kind of telephone number recognition device, at a high speed, accurately detect telephone number and The telephone number of various variants, the device includes:

Number section searching modul, for searching the phone number in target data to be detected according to deterministic finite state automata Section, the deterministic finite state automata is constructed according to predetermined telephone number section and phone number section variant;

Number identification module, for the phone number section to finding, phone number is gone out by telephone number normal form match cognization Code.

In one embodiment, the phone number section variant is generated according to phone number section;The predetermined telephone number section and phone Number section variant is stored in database.

In one embodiment, when there is new phone number section, the database is added into new phone number section and according to new Phone number section generation new phone number section variant.

In one embodiment, the deterministic finite state automata includes two array prefix trees, the two arrays prefix Tree includes state array and forerunner's state array;

The number section searching modul specifically for:Target data to be detected is inputted in the two arrays prefix trees, is searched Phone number section in target data to be detected.

In one embodiment, the device also includes:

Pretreatment module, for being pre-processed to target data to be detected, the pretreatment includes label removal, character One of conversion and character filtering or any combination;

The number section searching modul specifically for:Searched according to deterministic finite state automata pretreated to be detected Phone number section in target data.

In one embodiment, the device also includes:

Rule checking module, for entering line discipline inspection to the telephone number identified, the rule, which is checked, includes numeral Group checks, numerical frequency inspection and number width check one of them or any combination.

In the embodiment of the present application, the phone number in target data to be detected is searched according to deterministic finite state automata Section, deterministic finite state automata is constructed according to predetermined telephone number section and phone number section variant;To the phone number section found, Telephone number is gone out by telephone number normal form match cognization;Wherein employing construction certainty for phone number section compatible portion has State automata is limited, the part only needs to match the deformation number section that known number section and character are substituted, shared internal memory very little, matching Speed is fast;The matching of telephone number normal form is employed for telephone number matches part, the part carries out the character match in array, With higher efficiency;It is not only able to match telephone number in implementation process, also substantial amounts of deformation telephone number easy to deal with Situation, matching precision is high;And be that non-expression formula detection method completes a whole set of detection due to what is taken, it is not easy to compromised and nothing Method is detected, and security is greatly improved.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme in the embodiment of the present application, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.In the accompanying drawings:

Fig. 1 is the schematic flow sheet of telephone number recognition methods in the embodiment of the present application;

Fig. 2 is the instantiation process schematic of telephone number recognition methods in the embodiment of the present application;

Fig. 3 is database processing and the exemplary plot of deterministic finite state automata structure in the embodiment of the present application;

Fig. 4 is to realize the exemplary plot of webpage phone number recognition methods using JAVA in the embodiment of the present application;

Fig. 5 is the schematic flow sheet of another telephone number recognition methods in the embodiment of the present application;

Fig. 6 is the structural representation of telephone number recognition device in the embodiment of the present application;

Fig. 7 is the structural representation of an instantiation of telephone number recognition device shown in Fig. 6 in the embodiment of the present application;

Fig. 8 is the structural representation of another instantiation of telephone number recognition device shown in Fig. 6 in the embodiment of the present application;

Fig. 9 is the structural representation of the another instantiation of telephone number recognition device shown in Fig. 6 in the embodiment of the present application;

Figure 10 is the structural representation of another telephone number recognition device in the embodiment of the present application;

Figure 11 is the structural representation of an instantiation of telephone number recognition device shown in Figure 10 in the embodiment of the present application;

Figure 12 is the structural representation of another instantiation of telephone number recognition device shown in Figure 10 in the embodiment of the present application Figure.

Embodiment

For the purpose, technical scheme and advantage of the embodiment of the present application are more clearly understood, below in conjunction with the accompanying drawings to this Shen Please embodiment be described in further details.Here, the schematic description and description of the application is used to explain the application, but simultaneously Not as the restriction to the application.

In order in target data to be detected (such as webpage, text) high speed, accurately detect whether containing phone number Code (such as phone number, base number), and good adaptability is respectively provided with to the telephone number of various variants, the application is real Apply example and a kind of telephone number recognition methods are provided.

The deterministic finite state automata referred in the embodiment of the present application, which is one, can realize the automatic machine of state transfer. For a given state for belonging to the automatic machine and a character for belonging to the automatic machine alphabet ∑, it can be according to thing First given transfer function is transferred to next state (this state can be that previous state).Root in the embodiment of the present application According to phone number section and phone number section variant construction deterministic finite state automata.

, for example can be according to default when constructing deterministic finite state automata according to the application one embodiment Phone number section and phone number section variant construct two array prefix trees.Wherein default phone number section and phone number section variant can be Whole phone number sections and phone number section variant or default partial phone number section and phone number section variant.Trie (prefixes Tree or dictionary tree) it is a kind of ordered tree, to preserve Associate array.Two array Trie (Double-Array Trie) include State array (base arrays) and forerunner's state array (check arrays).Wherein each element representation of base arrays one The state of Trie nodes, i.e., one;Forerunner's state of some state of check array representations.

Illustrate the mistake that deterministic finite state automata is constructed according to phone number section and phone number section variant as an example below Journey.One two array Trie of construction are used in this example, specific steps can include:

1st, initialization represents the array base [] of state and the array check [] to check forerunner's state, array type There is int [] type.Initial value can be for example set to:Base [0]=1;Check [0]=0.

2nd, for every a group brotgher of node, such as [a1, a2, a3 ... an], a begin value is found so that check [begin + a1 ... an]=0, that is, n free space is have found to deposit these values.

3rd, the check values of this group of brotghers of node are set to check [begin+an]=begin.

If the 4, this brotgher of node does not have child, it is negative value to set its base value;Otherwise, child is inserted under the node (begin=present node base values, repeat step 2).

5th, all number sections insert completion, then deterministic finite state automata construction is finished.

After deterministic finite state automata has been constructed, mesh to be detected is searched according to the deterministic finite state automata Mark the phone number section in data.In embodiment, target data to be detected can be inputted in two array prefix trees of above-mentioned construction, Search the phone number section in target data to be detected.For example, searching number of targets to be detected in two array Trie of construction complete Whether the process comprising phone number section can include in:

1st, it is base [0]=1 to define current state p, and the character string char required to look up each character is inquired about successively;

2nd, set and be designated as n under the character string for being currently needed for searching, then the character newly inputted is char [n], the new shape jumped to State is base [char [n-1]]+char [n], checks check arrays, if check [base [char [n-1]+char [n]]]= Base [char [n-1]], the match is successful for representative, matches next time since current state.Otherwise, it fails to match, matching process Terminate.

As shown in figure 1, telephone number recognition methods can include in the embodiment of the present application:

Step 101, acquisition phone number section and phone number section variant;Whole phone number sections can be obtained in certain embodiments And phone number section variant, or default partial phone number section and phone number section variant can be obtained;

Step 102, deterministic finite state automata constructed according to phone number section and phone number section variant;

Step 103, the phone number section in deterministic finite state automata lookup target data to be detected;

Step 104, the phone number section to finding, telephone number is gone out by telephone number normal form match cognization.

Flow with being directed to a large amount of phone numbers in the prior art it is known that construct deterministic finite state as shown in Figure 1 The technical scheme of automatic machine or Non-Deterministic Finite State automatic machine is compared, for phone number section matching part in the embodiment of the present application Divide and employ construction deterministic finite state automata, the part only needs to match known number section (by taking mobile phone number section as an example, only Known number section more than 200) and character replacement deformation number section, shared internal memory very little, matching speed is fast;For telephone number Compatible portion employs the matching of telephone number normal form, and the part carries out the character match in array, with higher efficiency.And And, the matching precision of the embodiment of the present application is high, also the situation of substantial amounts of deformation telephone number easy to deal with;It is due to what is taken Non-express formula detection method completes a whole set of detection, it is not easy to compromised and can not be detected.

When it is implemented, passing through phone number section and phone number as the point of penetration of telephone number matches by the use of phone number section Section variant constructs a deterministic finite state automata, improves matching efficiency.Phone number section and phone number are first obtained during implementation Section variant, dictionary is provided for subsequent construction deterministic finite state automata.In embodiment, phone number section and phone number section become Formula can be stored in a database, and phone number section and phone number section variant are obtained from database.Obtained from database , it is necessary to first obtain phone number section before phone number section and phone number section variant, phone number section variant is generated according to phone number section, increased Identification to variant telephone number by force, database is stored in by phone number section and phone number section variant.In order to tackle in practical application The various variant telephone numbers run into, when generating phone number section variant automatically according to phone number section, are substantially carried out replacing for character Change, such as 130 mobile phone number section is, it is necessary to generate i30, and then 13o, each one of i3o number section is automatically credited database.

It is poor using matching regular expressions flexibility in the prior art, when needing to match some new features, Generally require to change whole regular expression.And in the embodiment of the present application, can be dynamically by current known phone number section It is added to database and constructs deterministic finite state automata automatically., can be with when it is implemented, when there is new phone number section New phone number section variant is generated according to new phone number section, new phone number section and new phone number section variant are added to number According to storehouse.The renewal of phone number section is generally very slow, known number section disposably can be added to database at first, runs Business is added to database in time again when updating new number section.

Stored in database after phone number section and phone number section variant, it is possible to according to the dictionary structure inside database Build out deterministic finite state automata.After deterministic finite state automata has been constructed, according to the deterministic finite state Automatic machine searches the phone number section in target data to be detected.

, can also be according to the limited shape of certainty in order to further improve the accuracy of telephone number detection in embodiment State automatic machine is searched before the phone number section in target data to be detected, and target data to be detected is pre-processed, follow-up right Pretreated target data to be detected, phone number section is searched according to deterministic finite state automata.Wherein pre-process for example One of processing such as label removal, character conversion and character filtering or any combination can be included.For example, web page text may Include substantial amounts of html labels, for html labels, htmlparser open source projects can be used to enter row label removal, obtained Obtain plain text.And for example, the phone number of illegal web page is often variant, the various characteristic characters of centre entrainment, such as 1=881O= 4. 450=38, for the plain text got, can enter line character conversion and character filtering, such as capital and small letter conversion, numeral conversion With spcial character filtering etc., during specific implementation can with it is self-defined some conversion and filtering character file storehouses for compare apply. Such as number above becomes 18810450384 by carrying out conversion.In embodiment, the mapping of converting text and original text is closed System may have inside an array, subsequently can be by inquiring about position of the array from converting text position recovering to original text The inside.

, it is necessary to further these phone numbers to finding after the phone number section in finding out target data to be detected Section, telephone number is gone out by telephone number normal form match cognization.Normal form refers to the generally acknowledged data structure with a certain form, one Plant the good data mode of generally acknowledged specification.Such as China mobile number, number normal form can be 11 bit digitals, operator's number section (3 Position)+area number section (4)+Subscriber Number (4).The definition of telephone number not strict regulation, using phone number as Example, 3 special number sections that for example can using only 11 bit lengths and above during matching are used as identification condition.For example, it may be considered that To the area code 86 of China, the normal form of number is continuous 11 bit digital started with 3 special number sections, or above includes China Number 86 printed words, behind meet continuous 13 bit digital of above-mentioned condition.Determined whether by such rule for phone number.Implement In example, it is contemplated that a large amount of variant numbers in illegal web page, digital fault-tolerant processing can be done to spcial character such as character o, i, Run into and also treat as digital processing.

In embodiment, the matching process of above-mentioned phone number section and normal form is too strong, is likely to result in the phenomenon of part overmatching, For some the overmatching problems run into real process, a series of special rules can be set reexamine sentencing in embodiment Not.In addition, utilizing matching regular expressions in the prior art, flexibility is poor, when needing to match some new features, Generally require to change whole regular expression.And in the embodiment of the present application, for the characteristic matching beyond telephone number normal form, Such as special rules inspection, more rules matching process can be used, when needing to match new feature, it is only necessary to increase or repair Change certain rule therein, therefore with stronger flexibility and adaptability.When it is implemented, to mentioned by telephone number model The telephone number that formula is identified enters line discipline inspection again, and these rules, which are checked, can for example include digital group's inspection, numeral frequency One of rate inspection and the inspection of number width etc. or any combination, these rules are very practical and convenience.

Digital group therein, which checks, to be implemented in following scene:For some data web pages, it may appear that big The numeric string of similar telephone number is measured, for the situation, digital group rules can be set, whether check number section two ends is numeral Either the array connector such as "-", checks the number whether in digital group, for continuous three number, many number situations, Markpoint marks can be increased, the position of a number is recorded, digital group number immediately after can let pass.

Numerical frequency inspection can be implemented in following scene:A large amount of irregular html labels, cause in webpage Htmlparser can not remove all labels well, often leave the css character strings of big section, and special numeric string holds very much Easily cause interference.According to actual amateurish scene, based on Chinese web page, and css counts number identified based on English character The frequency of English character and numeral in code certain distance, sets threshold value, and css interference is may be considered more than threshold value.

The inspection of number width can be implemented in following scene:Across the digit groups composite telephone number of label be also dry One of source is disturbed, is characterized in that digital distance is very big across label, causes to revert in webpage original text, the width of whole number is abnormal Greatly, rational width threshold value is set effectively to prevent this interference.

Fig. 2 is the instantiation process schematic of telephone number recognition methods in the embodiment of the present application, as shown in Fig. 2 this Predetermined telephone number section is first obtained in example, variant number section is generated, database is stored in, wherein default phone number section can be whole electricity Talk about number section or default partial phone number section;It is limited further according to phone number section and phone number section variant construction certainty State automata;By target data to be detected after for example web page text is pre-processed, according to deterministic finite state automata The phone number section in target data to be detected is searched, wherein pretreatment includes spcial character conversion and spcial character filtering etc.; Allot after phone number section, carry out telephone number normal form matching;Special rules inspection is finally carried out again to the telephone number identified, Such as primary block group is checked, numerical frequency inspection and number width are checked.Fig. 3 is for database processing in the embodiment of the present application and really The exemplary plot that qualitative finite-state automata is built, as shown in figure 3, phone number section is added into database, generates phone number section Phone number section variant is also stored in database after variant;Phone number section and phone number section variant in database are constructed really Qualitative finite-state automata.

The telephone number recognition methods of the embodiment of the present application can be achieved using main flow programming language JAVA or C++ etc..Fig. 4 For the exemplary plot of webpage phone number recognition methods is realized in the embodiment of the present application using JAVA.As shown in figure 4, in JAVA cores In system, cell-phone number section insertion is first carried out, mobile phone number section variant is automatically generated, mobile phone number section and mobile phone number section variant are stored in MYSQL database;The determination of two array prefix trees is constructed further according to mobile phone number section in MYSQL database and mobile phone number section variant Property finite-state automata;After webpage to be matched is inputted, web page text pretreatment is first carried out, further according to deterministic finite state Automatic machine matches mobile phone number section, and mobile phone number section carries out phone number normal form matching after the match is successful, finally carries out special rules inspection Look into, output matching result.

In another embodiment, the process of above-mentioned construction deterministic finite state automata can realize it by one The equipment of function is implemented, and the equipment can be distinct device with the follow-up equipment for carrying out telephone number identification.As shown in figure 5, this Example provides another telephone number recognition methods, including:

Step 501, the phone number section in deterministic finite state automata lookup target data to be detected, wherein really Qualitative finite-state automata is constructed according to predetermined telephone number section and phone number section variant;

Step 502, the phone number section to finding, telephone number is gone out by telephone number normal form match cognization.

Telephone number recognition methods shown in Fig. 5 are completed by the equipment for carrying out telephone number identification, and the equipment is different from structure The equipment for making deterministic finite state automata.

In one embodiment, phone number section variant is generated according to phone number section;Predetermined telephone number section and phone number section become Formula is stored in database.When there is new phone number section, the database is added into new phone number section and according to new electricity Talk about the new phone number section variant of number section generation.

In one embodiment, when deterministic finite state automata includes two array prefix trees, the two arrays prefix Tree includes state array and forerunner's state array, and target data to be detected can be inputted in the two arrays prefix trees, searches Phone number section in target data to be detected.

In embodiment can equally foregoing preprocessing process be performed to target data to be detected, can also be to the electricity that identifies Talk about number and perform aforementioned rule checking process.

Based on same inventive concept, a kind of telephone number recognition device is additionally provided in the embodiment of the present application, as following Described in embodiment.Because the principle that the device solves problem is similar to telephone number recognition methods, therefore the implementation of the device can Repeated no more with referring to the implementation of telephone number recognition methods, repeating part.

Fig. 6 is the structural representation of telephone number recognition device in the embodiment of the present application.As shown in fig. 6, the application is implemented Telephone number recognition device can include in example:

Number section acquisition module 601, for obtaining phone number section and phone number section variant;Number section acquisition module 601 is Fig. 6 institutes Show the part for being responsible for completing to obtain phone number section and phone number section variant function in telephone number recognition device, can be software, Hardware or the combination of the two, the component such as can be to complete the process chip of the offer function;

Automatic machine constructing module 602, for constructing deterministic finite state certainly according to phone number section and phone number section variant Motivation;Automatic machine constructing module 602 be responsible in telephone number recognition device shown in Fig. 6 completing construction deterministic finite state from The part of motivational function, can be software, hardware or the combination of the two, for example, can be the process chip for completing the offer function Deng component;

Number section searching modul 603, for being searched according to the deterministic finite state automata in target data to be detected Phone number section;Number section searching modul 603 is responsible completion lookup phone number section function in telephone number recognition device shown in Fig. 6 Part, can be software, hardware or the combination of the two, first device such as can be to complete the process chip of the offer function Part;

Number identification module 604, for the phone number section to finding, phone is gone out by telephone number normal form match cognization Number.Number identification module 604 is responsible completion telephone number normal form matching feature in telephone number recognition device shown in Fig. 6 Part, can be software, hardware or the combination of the two, the component such as can be to complete the process chip of the offer function.

In one embodiment, the number section acquisition module 601 specifically can be used for:Phone number section is obtained from database And phone number section variant;As shown in fig. 7, telephone number recognition device shown in Fig. 6 can also include in this example:

Database processing module 701, for obtaining phone number section and phone from database in the number section acquisition module Before number section variant, phone number section is obtained, phone number section variant is generated according to phone number section;Phone number section and phone number section are become Formula is stored in database.Database processing module 701 is responsible completion database processing work(in telephone number recognition device shown in Fig. 7 The part of energy, can be software, hardware or the combination of the two, first device such as can be to complete the process chip of the offer function Part.

When it is implemented, the database processing module 701 can be also used for:

When there is new phone number section, new phone number section variant is generated according to new phone number section, by new phone number Section and new phone number section variant are added to database.

When it is implemented, the automatic machine constructing module 602 specifically can be used for:Become according to phone number section and phone number section Formula constructs two array prefix trees, and the two arrays prefix trees include state array and forerunner's state array;

The number section searching modul 603 specifically can be used for:Number of targets to be detected is inputted in the two arrays prefix trees According to the phone number section in lookup target data to be detected.

Fig. 8 is the instantiation figure of telephone number recognition device shown in Fig. 6 in the embodiment of the present application.As shown in figure 8, Fig. 6 Shown telephone number recognition device can also include:

Pretreatment module 801, for being pre-processed to target data to be detected, it is described pretreatment include label removal, One of character conversion and character filtering or any combination;During pretreatment module 801 is telephone number recognition device shown in Fig. 8 It is responsible for completing the part of target data preprocessing function to be detected, can is software, hardware or the combination of the two, for example, can be The components such as the process chip of the offer function are provided.

The number section searching modul 603 specifically can be used for:Pre- place is searched according to the deterministic finite state automata The phone number section in target data to be detected after reason.Pretreatment module 801 can also be contained in the electricity shown in Fig. 7 in embodiment Talk about in NID number identifier.

Fig. 9 is the instantiation figure of telephone number recognition device shown in Fig. 6 in the embodiment of the present application.As shown in figure 9, Fig. 6 Shown telephone number recognition device can also include:

Rule checking module 901, for entering line discipline inspection to the telephone number identified, the rule, which is checked, includes number Sub-block group is checked, numerical frequency inspection and number width check one of them or any combination.Rule checking module in embodiment 901 can also be contained in the telephone number recognition device shown in Fig. 7 or Fig. 8.Rule checking module 901 is phone shown in Fig. 9 It is responsible for completing the part of the regular audit function of telephone number in NID number identifier, can is software, hardware or the combination of the two, The component such as can be to complete the process chip of the offer function.

Figure 10 is the structural representation of another telephone number recognition device in the embodiment of the present application.As shown in Figure 10, this Shen Telephone number recognition device it please can include in embodiment:

Number section searching modul 1001, for searching the electricity in target data to be detected according to deterministic finite state automata Number section is talked about, deterministic finite state automata is constructed according to predetermined telephone number section and phone number section variant;Number section searching modul 1001 be the part for being responsible in telephone number recognition device shown in Figure 10 completing to search phone number section function, can be software, hard Part or the combination of the two, the component such as can be to complete the process chip of the offer function;

Number identification module 1002, for the phone number section to finding, electricity is gone out by telephone number normal form match cognization Talk about number.Number identification module 1002 is responsible completion telephone number normal form matching work(in telephone number recognition device shown in Figure 10 The part of energy, can be software, hardware or the combination of the two, first device such as can be to complete the process chip of the offer function Part.

In one embodiment, phone number section variant is generated according to phone number section;Predetermined telephone number section and phone number section become Formula is stored in database.

When it is implemented, when there is new phone number section, database is added into new phone number section and according to new phone The new phone number section variant of number section generation.

When it is implemented, deterministic finite state automata includes two array prefix trees, the two arrays prefix trees include State array and forerunner's state array;

The number section searching modul 1001 specifically can be used for:Number of targets to be detected is inputted in the two arrays prefix trees According to the phone number section in lookup target data to be detected.

Figure 11 is the instantiation figure of telephone number recognition device shown in Figure 10 in the embodiment of the present application.As shown in figure 11, Telephone number recognition device shown in Figure 10 can also include:

Pretreatment module 1101, for being pre-processed to target data to be detected, it is described pretreatment include label removal, One of character conversion and character filtering or any combination;Pretreatment module 1101 is telephone number recognition device shown in Figure 11 In be responsible for completing the part of target data preprocessing function to be detected, can be software, hardware or the combination of the two, for example can be with It is that the components such as the process chip of the offer function are provided.

The number section searching modul 1001 specifically can be used for:Pre- place is searched according to the deterministic finite state automata The phone number section in target data to be detected after reason.

Figure 12 is the instantiation figure of telephone number recognition device shown in Figure 10 in the embodiment of the present application.As shown in figure 12, Telephone number recognition device shown in Figure 10 can also include:

Rule checking module 1201, for entering line discipline inspection to the telephone number identified, the rule, which is checked, to be included Digital group checks, numerical frequency inspection and number width check one of them or any combination.Rule checks mould in embodiment Block 1201 can also be contained in the telephone number recognition device shown in Figure 11.Rule checking module 1201 is phone shown in Figure 12 It is responsible for completing the part of the regular audit function of telephone number in NID number identifier, can is software, hardware or the combination of the two, The component such as can be to complete the process chip of the offer function.

In summary, in the embodiment of the present application, searched according to deterministic finite state automata in target data to be detected Phone number section, deterministic finite state automata constructs according to predetermined telephone number section and phone number section variant;To what is found Phone number section, telephone number is gone out by telephone number normal form match cognization;Wherein structure is employed for phone number section compatible portion Deterministic finite state automata is made, the part only needs to match the deformation number section that known number section and character are substituted, and shared is interior Very little is deposited, matching speed is fast;The matching of telephone number normal form is employed for telephone number matches part, the part is carried out in array Character match, with higher efficiency;It is not only able to match telephone number in implementation process, also substantial amounts of change easy to deal with The situation of shape telephone number, matching precision is high;And be non-expression formula detection method due to what is taken, by a whole set of complete phone Number identification method or device complete a whole set of detection, it is not easy to compromised and can not be detected, and security is greatly improved.It is right in addition Conversion and rule based judgment part in character, also use character match in array, with higher efficiency.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The application is the flow with reference to method, equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Particular embodiments described above, purpose, technical scheme and beneficial effect to the application have been carried out further in detail Describe in detail bright, should be understood that the specific embodiment that the foregoing is only the application, be not used to limit the guarantor of the application Scope, all any modification, equivalent substitution and improvements within spirit herein and principle, done etc. are protected, this is should be included in Within the protection domain of application.

Claims (7)

1. a kind of telephone number recognition methods, it is characterised in that including:
Phone number section in target data to be detected, the deterministic finite state are searched according to deterministic finite state automata Automatic machine is constructed according to predetermined telephone number section and phone number section variant;
To the phone number section found, telephone number is gone out by telephone number normal form match cognization.
2. the method as described in claim 1, it is characterised in that the phone number section variant is generated according to phone number section;It is described Predetermined telephone number section and phone number section variant are stored in database.
3. method as claimed in claim 2, it is characterised in that when there is new phone number section, the database is added into newly Phone number section and the new phone number section variant that is generated according to new phone number section.
4. the method as described in claim 1, it is characterised in that the deterministic finite state automata includes two array prefixes Tree, the two arrays prefix trees include state array and forerunner's state array;
Phone number section in target data to be detected is searched according to deterministic finite state automata, including:In two array Target data to be detected is inputted in prefix trees, the phone number section in target data to be detected is searched.
5. the method as described in claim 1, it is characterised in that target to be detected is searched according to deterministic finite state automata Before phone number section in data, in addition to:Target data to be detected is pre-processed, the pretreatment includes label Remove, character conversion and one of character filtering or any combination;
Phone number section in target data to be detected is searched according to deterministic finite state automata, including:Had according to certainty Limit the phone number section in the pretreated target data to be detected of state automata lookup.
6. the method as described in any one of claim 1 to 5, it is characterised in that also include:
Enter line discipline inspection to the telephone number identified, the rule, which is checked, includes digital group's inspection, numerical frequency inspection One of them or any combination are checked with number width.
7. a kind of telephone number recognition device, it is characterised in that including:
Number section searching modul, for searching the phone number section in target data to be detected according to deterministic finite state automata, The deterministic finite state automata is constructed according to predetermined telephone number section and phone number section variant;
Number identification module, for the phone number section to finding, telephone number is gone out by telephone number normal form match cognization.
CN201710001599.4A 2016-01-13 2017-01-03 Telephone number recognition methods and device CN107038452A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610019509X 2016-01-13
CN201610019509 2016-01-13

Publications (1)

Publication Number Publication Date
CN107038452A true CN107038452A (en) 2017-08-11

Family

ID=59530442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710001599.4A CN107038452A (en) 2016-01-13 2017-01-03 Telephone number recognition methods and device

Country Status (1)

Country Link
CN (1) CN107038452A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182180A (en) * 2018-01-30 2018-06-19 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103188267A (en) * 2013-03-27 2013-07-03 中国科学院声学研究所 Protocol analyzing method based on DFA (Deterministic Finite Automaton)
US8935270B1 (en) * 2010-05-13 2015-01-13 Netlogic Microsystems, Inc. Content search system including multiple deterministic finite automaton engines having shared memory resources

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8935270B1 (en) * 2010-05-13 2015-01-13 Netlogic Microsystems, Inc. Content search system including multiple deterministic finite automaton engines having shared memory resources
CN103188267A (en) * 2013-03-27 2013-07-03 中国科学院声学研究所 Protocol analyzing method based on DFA (Deterministic Finite Automaton)

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
乐小虬: "非结构化网络空间信息智能搜索与服务研究", 《中国优秀博硕士学位论文全文数据库 (博士)_基础科学辑》 *
叶娜: "面向信息抽取的文本预处理和规则自动学习技术研究", 《中国优秀博硕士学位论文全文数据库 (硕士)_信息科技辑》 *
孟伟涛: "Web中文信息抽取技术研究及其在招聘信息系统中的应用", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *
戴耿毅等: "基于双数组Trie树算法的字典改进和实现", 《软件导刊》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182180A (en) * 2018-01-30 2018-06-19 百度在线网络技术(北京)有限公司 For generating the method and apparatus of information
CN108182180B (en) * 2018-01-30 2019-10-11 百度在线网络技术(北京)有限公司 Method and apparatus for generating information

Similar Documents

Publication Publication Date Title
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
US8423572B2 (en) Fast identification of complex strings in a data stream
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
CN102844759B (en) For by the equipment of input string and matching regular expressions and method
Li et al. Recursive deep models for discourse parsing
Rizzo et al. NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud.
Nguyen et al. Relation extraction: Perspective from convolutional neural networks
Ekbal et al. Language independent named entity recognition in indian languages
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
US7606816B2 (en) Record boundary identification and extraction through pattern mining
CN103488724A (en) Book-oriented reading field knowledge map construction method
Kumar et al. Part of speech taggers for morphologically rich indian languages: a survey
CN105701253B (en) The knowledge base automatic question-answering method of Chinese natural language question semanteme
Karlsson et al. A process model of morphology and lexicon
Tay et al. Compare, compress and propagate: Enhancing neural architectures with alignment factorization for natural language inference
Anand Kumar et al. A sequence labeling approach to morphological analyzer for tamil language
WO2008145055A1 (en) The method for obtaining restriction word information, optimizing output and the input method system
CN104408093B (en) A kind of media event key element abstracting method and device
CN102768681B (en) Recommending system and method used for search input
EP2728508B1 (en) Dynamic data masking
CN106131071B (en) A kind of Web method for detecting abnormality and device
Ekbal et al. Voted NER system using appropriate unlabeled data
CN104881608B (en) A kind of XSS leak detection methods based on simulation browser behavior
CN102682090A (en) System and method for matching and processing sensitive words on basis of polymerized word tree
CN101639830A (en) Chinese term automatic correction method in input process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination