CN109359274A - The method, device and equipment that the character string of a kind of pair of Mass production is identified - Google Patents

The method, device and equipment that the character string of a kind of pair of Mass production is identified Download PDF

Info

Publication number
CN109359274A
CN109359274A CN201811074092.2A CN201811074092A CN109359274A CN 109359274 A CN109359274 A CN 109359274A CN 201811074092 A CN201811074092 A CN 201811074092A CN 109359274 A CN109359274 A CN 109359274A
Authority
CN
China
Prior art keywords
character string
identified
probability
substring
occurs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811074092.2A
Other languages
Chinese (zh)
Other versions
CN109359274B (en
Inventor
江大鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANT Financial Hang Zhou Network Technology Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811074092.2A priority Critical patent/CN109359274B/en
Publication of CN109359274A publication Critical patent/CN109359274A/en
Application granted granted Critical
Publication of CN109359274B publication Critical patent/CN109359274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

Subject description discloses the method, device and equipments that the character string of a kind of pair of Mass production is identified.This method comprises: receiving the character string to be identified of Mass production;The character string to be identified is split, the substring of at least one character string to be identified is obtained;Determine the probability that at least one substring of the character string to be identified occurs, the degree of randomness of character string to be identified described in the determine the probability occurred according to the substring;According to the degree of randomness of the character string to be identified, judge whether the character string to be identified is the character string generated at random.

Description

The method, device and equipment that the character string of a kind of pair of Mass production is identified
Technical field
This specification is related to field of computer technology, is identified more particularly, to the character string of a kind of pair of Mass production Method, device and equipment.
Background technique
With the development of internet technology and popularization and application, the character string in more and more network platforms be by machine from The character string of dynamic Mass production.By taking batch registration account as an example, the various function of platform are can be used in the account of these batch registrations Energy.Since ordinary user does not use this kind of account, many rubbish contents are brought to platform, even generate money damage.For example, The comment waterborne troops of information class application, numerous accounts are expressed in a short time and its similar viewpoint, and guide public opinion trend, influence just Normal user experience.For another example, electric business class website is had " wool party " etc. to seek the people of petty gains, is just obtained using batch registration account The subsidy resource of electric business class website is taken, so that marketing money waste is serious, marketing effectiveness is had a greatly reduced quality.
In the prior art, being known to this kind of account is identified by supervised learning sorting algorithm otherwise, such as LR, SVM etc. classify to account.The algorithm needs first pass through manually mark a large amount of accounts be common account or random account, Training data train classification models are obtained, are then classified to the account of input, it is very big to manpower consumption.Moreover, because The information content that the lesser character string of entire length includes is very little, therefore disaggregated model imitates the lesser string sort of entire length Fruit is poor, cannot preferably identify.
Summary of the invention
This specification embodiment provides the method, device and equipment that the character string of a kind of pair of Mass production is identified.Solution The a large amount of account consumption manpowers of artificial mark of having determined are big and disaggregated model compares the lesser string sort effect of entire length The problem of difference.
In order to solve the above technical problems, this specification embodiment is achieved in that
The character string for a kind of pair of Mass production that this specification embodiment provides carries out knowledge method for distinguishing, this method comprises:
Receive the character string to be identified of Mass production;
The character string to be identified is split, the substring of at least one character string to be identified is obtained;
It determines the probability that at least one substring of the character string to be identified occurs, is occurred according to the substring Determine the probability described in character string to be identified degree of randomness;
According to the degree of randomness of the character string to be identified, judge whether the character string to be identified generates at random Character string.
The device that the character string for a kind of pair of Mass production that this specification embodiment provides is identified, the device include: Receiving module, segmentation module, determining module and judgment module;
The receiving module, for receiving the character string to be identified of Mass production;
The segmentation module obtains at least one described word to be identified for being split to the character string to be identified Accord with the substring of string;
The determining module, the probability occurred for determining at least one substring of the character string to be identified, root The degree of randomness of character string to be identified described in the determine the probability occurred according to the substring;
The judgment module judges the character to be identified for the degree of randomness according to the character string to be identified Whether string is the character string generated at random.
The equipment that the character string for a kind of pair of Mass production that this specification embodiment provides is identified, comprising: memory And processor, the memory store program, and are configured to be executed by the processor above-mentioned to Mass production Character string carries out knowledge method for distinguishing.
At least one above-mentioned technical solution that this specification embodiment uses can reach following the utility model has the advantages that passing through determination The substring probability of occurrence of character string, determines the degree of randomness of character string, further judges whether the character string is random The character string of generation, whole process save human cost without manually marking a large amount of training data;For character to be identified The type of string can targetedly select sample string data;It improves and the lesser character string of entire length is known Other effect.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in this specification, for those of ordinary skill in the art, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.
The process that the character string for a kind of pair of Mass production that Fig. 1 provides for this specification embodiment know method for distinguishing is shown It is intended to;
Fig. 2 is another stream that the character string for a kind of pair of Mass production that this specification embodiment provides know method for distinguishing Journey schematic diagram;
Fig. 3 is that the structure for the device that the character string for a kind of pair of Mass production that this specification embodiment provides is identified is shown It is intended to.
Specific embodiment
The method, apparatus and equipment that the character string that this specification embodiment provides a kind of pair of Mass production is identified.
In order to make those skilled in the art more fully understand the technical solution in this specification, below in conjunction with this explanation Attached drawing in book embodiment is clearly and completely described the technical solution in this specification embodiment, it is clear that described Embodiment be merely a part but not all of the embodiments of the present application.Based on this specification embodiment, this field Those of ordinary skill's every other embodiment obtained without creative efforts, all should belong to the application The range of protection.
The process that the character string for a kind of pair of Mass production that Fig. 1 provides for this specification embodiment know method for distinguishing is shown It is intended to, which includes:
Step 105, the character string to be identified of Mass production is received;
In this specification embodiment, by taking the account of major network platform as an example, these accounts are spliced by character Character string.The account that machine automatically generates very maximum probability is the random string being spliced by character, such as " iehfdjksyneyg ", and the account of most of ordinary user's registration all can be using the character string with certain meaning, such as " ilovekobe ", the account that machine automatically generates, character string degree of randomness is much larger than the account that ordinary user oneself registers Character string degree of randomness.
If step 220 inputs character string (character string to be identified of Mass production) in Fig. 2, in this specification embodiment, The character string that receiving step 220 inputs, to receive character string " for ak, ti odoe dgza ".
Step 110, the character string to be identified is split, obtains the sub- word of at least one character string to be identified Symbol string;
Preferably, " ak, ti odoe dgza " are pre-processed character string to be identified first received to step 105, removal The non-serviceable character of the accounts such as space and punctuation mark, character string is " aktiodoedgza " after being pre-processed;Divide again Character string after pretreatment obtains at least one substring, as shown in step 225 in Fig. 2.
It should be noted that in this specification embodiment, after preset characters length is to pretreatment character string into Row segmentation, such as every two characters to string segmentation once and/or every three characters it is primary to string segmentation, obtain to A few substring.
In this specification embodiment, if taking the N=2 of N-gram model, to character string after pretreatment " aktiodoedgza " is split, and obtaining substring is " ak ", " ti ", " od ", " oe ", " dg " and " za ";If taking N- The N=3 of gram model is then split character string after pretreatment " aktiodoedgza ", obtain substring be " akt ", " iod ", " oed " and " gza ".
Step 115, the probability that at least one substring of the character string to be identified occurs is determined, according to the sub- word The degree of randomness of character string to be identified described in the determine the probability that symbol string occurs;
In this specification embodiment, first with probability dictionary, match character string to be identified " ak, ti odoe dgza's " The probability that substring " ak ", " ti ", " od ", " oe ", " dg " and " za " occurs.According to above-mentioned substring occur probability, Calculating character string to be identified, " probability that ak, ti odoe dgza " occur, further determines that character string to be identified " ak, ti odoe The degree of randomness R of dgza ", as shown in step 230 in Fig. 2;Wherein, probability dictionary includes sample substring and the sub- word of sample Accord with the corresponding relationship between the probability of string.Specifically, obtain substring " ak ", " ti ", " od ", " oe ", " dg " and The probability that " za " individually occurs be respectively in the case of 0.79,0.59,0.63,0.71,0.56 and 0.68 calculate 0.79,0.59, 0.63,0.71,0.56 and 0.68 geometrical mean is 0.66 as character string to be identified " ak, ti odoe dgza " appearance Probability P, further, character string to be identified " the degree of randomness R=1-P of ak, ti odoe dgza ", then degree of randomness R It is 0.34;Or obtaining at least two substrings adjacent in substring " ak ", " ti ", " od ", " oe ", " dg " and " za " Simultaneously occur probability scenarios under, using at least two adjacent substrings simultaneously occur probabilistic geometry average value as The probability P that above-mentioned character string to be identified occurs.Below with obtain adjacent two substrings " ak " and " ti ", " ti " and The probability that " od ", " od " and " oe ", " oe " and " dg " and " dg " and " za " occurs simultaneously is respectively 0.69,0.69,0.63, 0.71, for 0.66, the geometrical mean for calculating 0.69,0.69,0.63,0.71,0.66 is 0.68 as character string to be identified " probability P that ak, ti odoe dgza " occur, further, character string " the randomness journey of ak, ti odoe dgza " to be identified R=1-P is spent, then degree of randomness R is 0.32;Or obtain at the same time above-mentioned character string to be identified " ak, ti odoe dgza's " It is under probability and two adjacent substrings while the probability scenarios of appearance that substring individually occurs, substring is independent The arithmetic average for the probabilistic geometry average value that the probabilistic geometry average value of appearance and two adjacent substrings occur simultaneously As above-mentioned character string " probability P that ak, ti odoe dgza " occur, probability P 0.67 to be identified.Further according to above-mentioned to be identified The probability 0.67 that character string occurs determines that the degree of randomness R of character string to be identified is 0.33.
It should be noted that it is above-mentioned utilize probability dictionary, match character string to be identified " ak, ti odoe dgza's " Before the probability that substring " ak ", " ti ", " od ", " oe ", " dg " and " za " occurs, probability dictionary is first obtained.In this explanation In book embodiment, the type of sample string data and the character string type to be identified of Mass production are identical.Therefore to obtain English For literary magazine, English webpage or other english articles that can normally obtain are as sample string data, as walked in Fig. 2 Shown in rapid 205.Further, sample string data is split, obtains several sample substrings;As walked in Fig. 2 Shown in rapid 210, the number and/or at least two adjacent sample substrings that several sample substrings individually occur are counted The number occurred simultaneously;Calculate probability that several described sample substrings individually occur and/or described adjacent at least two The probability that a sample substring occurs simultaneously, obtains probability dictionary, as shown in step 215 in Fig. 2;Wherein, in probability dictionary The probability that individually occurs comprising several sample substrings and several described sample substrings and/or comprising it is adjacent extremely The probability that few two sample substrings and at least two adjacent sample substrings occur simultaneously.
Step 120, according to the degree of randomness of the character string to be identified, judge the character string to be identified whether be with The character string that machine generates.
In this specification embodiment, as shown in step 235 in Fig. 2, judges degree of randomness R and to preset random threshold value big It is small.As shown in step 240 in Fig. 2, in above-mentioned character string to be identified, " the degree of randomness R of ak, ti odoe dgza " are greater than pre- If in the case where random threshold value, it is determined that " ak, ti odoe dgza " are the character string generated at random to character string to be identified.It is above-mentioned Preset random threshold value=1- predetermined probabilities threshold value;Wherein, above-mentioned predetermined probabilities threshold value is the sub- word of several samples in probability dictionary The median for the probability that symbol string individually occurs;Or in probability dictionary adjacent at least two sample substrings occur simultaneously it is general The median of rate;Or the median for the probability that several sample substrings individually occur in probability dictionary and phase in probability dictionary The arithmetic mean number of the median for the probability that at least two adjacent sample substrings occur simultaneously.It is with predetermined probabilities threshold value For 0.7, obtain presetting random threshold value being 0.3.Character string to be identified " ak, ti the odoe dgza " that above-mentioned steps 115 obtain Degree of randomness R be greater than preset random threshold value 0.3.Therefore, " ak, ti odoe dgza " are random generate to character string to be identified Character string.As shown in step 245 in Fig. 2, in the case where the degree of randomness R of character string is no more than random threshold value is preset, Character string is general character string.
Further, in this specification embodiment, the character string that is generated at random to above-mentioned " ak, ti odoe dgza " into Row emphasis prevention and control, specifically, restricted character string " permission of ak, ti odoe dgza ", or to character string " ak, ti odoe Dgza " reinforces verifying, or forbids character string " ak, ti odoe dgza " logging in online platform.
Compared with prior art, this specification embodiment use above-mentioned technical proposal can reach it is following the utility model has the advantages that By determining the substring probability of occurrence of character string, determines the degree of randomness of character string, further judge that the character string is The no character string to generate at random, whole process save human cost without manually marking a large amount of training data;For to The type of identification string can targetedly select sample string data;It improves to the lesser character of entire length The effect that string is identified.
Fig. 3 is that the structure for the device that the character string for a kind of pair of Mass production that this specification embodiment provides is identified is shown It is intended to, which includes: receiving module 305, segmentation module 310, determining module 315 and judgment module 320;
The receiving module 305, for receiving the character string to be identified of Mass production;
It is described to be identified to obtain at least one for being split to the character string to be identified for the segmentation module 310 The substring of character string;
The determining module 315, the probability occurred for determining at least one substring of the character string to be identified, The degree of randomness of character string to be identified described in the determine the probability occurred according to the substring;
The judgment module 320 judges the word to be identified for the degree of randomness according to the character string to be identified Whether symbol string is the character string generated at random.
Preferably, the determining module 315 is specifically used for utilizing probability dictionary, matches the son of the character string to be identified The probability that character string occurs, the probability dictionary include the corresponding pass between sample substring and the probability of sample substring System;According to the probability that the substring occurs, the degree of randomness of the character string to be identified is determined.
Preferably, described device further include: probability dictionary obtains module, for being split to sample string data, Obtain several sample substrings;Count the number and/or adjacent at least two that several sample substrings individually occur The number that a sample substring occurs simultaneously;Calculate the probability and/or institute that several described sample substrings individually occur The probability stating at least two adjacent sample substrings while occurring, obtains probability dictionary;Wherein, if including in probability dictionary Probability that dry sample substring and several described sample substrings individually occur and/or comprising adjacent at least two The probability that sample substring and at least two adjacent sample substrings occur simultaneously.
Preferably, the type of the sample string data and the character string type to be identified of the Mass production are identical.
Preferably, the determining module 315, also particularly useful for the probability occurred according to the substring, determine described in The probability that character string to be identified occurs;According to the probability that the character string to be identified occurs, the character string to be identified is determined Degree of randomness.
It is highly preferred that the determining module 315, also particularly useful for obtaining the substring list of the character string to be identified Under the probability scenarios solely occurred, the probabilistic geometry average value that the substring is individually occurred is as the character string to be identified The probability P of appearance;Or in at least two adjacent substrings for obtaining the character string to be identified while the probability feelings occurred Under condition, the probabilistic geometry average value that at least two adjacent substrings are occurred simultaneously is as the character string to be identified The probability P of appearance;Or the probability and the character to be identified individually occurred in the substring for obtaining the character string to be identified Under the probability scenarios that at least two adjacent substrings of string occur simultaneously, the probability that the substring is individually occurred is several The arithmetic average conduct for the probabilistic geometry average value that average value and at least two adjacent substrings occur simultaneously The probability P that the character string to be identified occurs.
Further, the determining module 315, also particularly useful for the degree of randomness R of the determination character string to be identified The probability P that character string to be identified described in=1- occurs.
Preferably, the judgment module 320 is greater than pre- specifically for the degree of randomness R in the character string to be identified If the character string to be identified is the character string generated at random in the case where random threshold value.
Preferably, described to preset random threshold value=1- predetermined probabilities threshold value;Wherein, the predetermined probabilities threshold value is described general The median for the probability that several sample substrings individually occur in rate dictionary;Or adjacent at least two in the probability dictionary The median for the probability that a sample substring occurs simultaneously;Or several sample substrings individually go out in the probability dictionary In the probability that at least two adjacent sample substrings occur simultaneously in the median of existing probability and the probability dictionary The arithmetic mean number of digit.
Preferably, described device further include: emphasis prevention and control module, for determining what character string to be identified was randomly generated In the case where character string, emphasis prevention and control are carried out to the character string generated at random;Wherein, the emphasis prevention and control include limitation power Limit at least one of reinforces verifying and/or forbids logging in.
The equipment that the character string for a kind of pair of Mass production that this specification embodiment also provides is identified, comprising: storage Device and processor, the memory store program, and be configured to be executed by the processor receive Mass production to Identification string;The character string to be identified is split, the substring of at least one character string to be identified is obtained; Determine that the probability that at least one substring of the character string to be identified occurs, the probability occurred according to the substring are true The degree of randomness of the fixed character string to be identified;According to the degree of randomness of the character string to be identified, judgement is described wait know Whether other character string is the character string generated at random.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of computer, special purpose computer, Embedded Processor or other programmable data processing devices to generate one A machine so that by the instruction that the processor of computer or other programmable data processing devices executes generate for realizing The device for the function of being specified in one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.
The above is only the embodiments of this specification, are not limited to this specification.For those skilled in the art For, this specification can have various modifications and variations.All any modifications made within the spirit and principle of this specification, Equivalent replacement, improvement etc., should be included within the scope of the claims of this specification.

Claims (21)

1. the character string of a kind of pair of Mass production carries out knowledge method for distinguishing, which is characterized in that this method comprises:
Receive the character string to be identified of Mass production;
The character string to be identified is split, the substring of at least one character string to be identified is obtained;
It determines the probability that at least one substring of the character string to be identified occurs, is occurred according to the substring general Rate determines the degree of randomness of the character string to be identified;
According to the degree of randomness of the character string to be identified, judge whether the character string to be identified is the character generated at random String.
2. the character string according to claim 1 to Mass production carries out knowledge method for distinguishing, which is characterized in that the determination The probability that at least one substring of the character string to be identified occurs, the determine the probability institute occurred according to the substring State the degree of randomness of character string to be identified, comprising:
Using probability dictionary, the probability that the substring of the character string to be identified occurs is matched, the probability dictionary includes sample Corresponding relationship between this substring and the probability of sample substring;
According to the probability that the substring occurs, the degree of randomness of the character string to be identified is determined.
3. the character string according to claim 2 to Mass production carries out knowledge method for distinguishing, which is characterized in that the utilization Probability dictionary, before matching the probability that the substring of the character string to be identified occurs, the method also includes:
Sample string data is split, several sample substrings are obtained;
It counts the number and/or at least two adjacent sample substrings that several sample substrings individually occur while going out Existing number;
Calculate probability and/or the adjacent sub- character of at least two samples that several described sample substrings individually occur String while the probability occurred, obtain probability dictionary;
Wherein, individually occur in probability dictionary comprising several sample substrings and several described sample substrings general Rate and/or occur simultaneously comprising at least two adjacent sample substrings and at least two adjacent sample substrings Probability.
4. the character string according to claim 3 to Mass production carries out knowledge method for distinguishing, which is characterized in that the method Further include: the type of the sample string data and the character string type to be identified of the Mass production are identical.
5. the character string according to claim 2 to Mass production carries out knowledge method for distinguishing, which is characterized in that the basis The degree of randomness of character string to be identified described in the determine the probability that the substring occurs includes:
According to the probability that the substring occurs, the probability that the character string to be identified occurs is determined;
According to the probability that the character string to be identified occurs, the degree of randomness of the character string to be identified is determined.
6. the character string according to claim 5 to Mass production carries out knowledge method for distinguishing, which is characterized in that the basis The probability that the substring occurs, the probability for determining that the character string to be identified occurs include:
Under the probability scenarios that the substring for obtaining the character string to be identified individually occurs, the substring is individually gone out The probability P that existing probabilistic geometry average value occurs as the character string to be identified;Or
It, will be described under at least two adjacent substrings for obtaining the character string to be identified while the probability scenarios occurred The probability that the probabilistic geometry average value that at least two adjacent substrings occur simultaneously occurs as the character string to be identified P;Or
The probability that individually occurs in the substring for obtaining the character string to be identified and the character string to be identified it is adjacent Under the probability scenarios that at least two substrings occur simultaneously, probabilistic geometry average value that the substring is individually occurred with The arithmetic average for the probabilistic geometry average value that at least two adjacent substrings occur simultaneously is as described to be identified The probability P that character string occurs.
7. the character string according to claim 6 to Mass production carries out knowledge method for distinguishing, which is characterized in that the basis It is described wait know to determine that the degree of randomness of the character string to be identified comprises determining that for the probability that the character string to be identified occurs The probability P that character string to be identified described in the degree of randomness R=1- of other character string occurs.
8. the character string according to claim 7 to Mass production carries out knowledge method for distinguishing, which is characterized in that the basis The degree of randomness of the character string to be identified judges whether the character string to be identified is that the character string generated at random includes:
In the case where the degree of randomness R of the character string to be identified is greater than and presets random threshold value, the character string to be identified For the character string generated at random.
9. the character string according to claim 8 to Mass production carries out knowledge method for distinguishing, which is characterized in that
It is described to preset random threshold value=1- predetermined probabilities threshold value;
Wherein, in the probability that the predetermined probabilities threshold value individually occurs for several sample substrings in the probability dictionary Digit;Or the median for the probability that at least two adjacent sample substrings occur simultaneously in the probability dictionary;Or it is described In the median for the probability that several sample substrings individually occur in probability dictionary and the probability dictionary it is adjacent at least The arithmetic mean number of the median for the probability that two sample substrings occur simultaneously.
10. the character string according to claim 8 to Mass production carries out knowledge method for distinguishing, which is characterized in that the side Method further include:
In the case where determining the character string that character string to be identified is randomly generated, emphasis is carried out to the character string generated at random Prevention and control;
Wherein, the emphasis prevention and control include binding authority, at least one of reinforce verifying and/or forbid logging in.
11. the device that the character string of a kind of pair of Mass production is identified, which is characterized in that the device includes: receiving module, divides Cut module, determining module and judgment module;
The receiving module, for receiving the character string to be identified of Mass production;
The segmentation module obtains at least one described character string to be identified for being split to the character string to be identified Substring;
The determining module, the probability occurred for determining at least one substring of the character string to be identified, according to institute State the degree of randomness of character string to be identified described in the determine the probability of substring appearance;
The judgment module judges that the character string to be identified is for the degree of randomness according to the character string to be identified The no character string to generate at random.
12. the device that the character string according to claim 11 to Mass production is identified, which is characterized in that described true Cover half block is specifically used for utilizing probability dictionary, matches the probability that the substring of the character string to be identified occurs, the probability Dictionary includes the corresponding relationship between sample substring and the probability of sample substring;Occurred according to the substring Probability determines the degree of randomness of the character string to be identified.
13. the device that the character string according to claim 12 to Mass production is identified, which is characterized in that the dress It sets further include: probability dictionary obtains module and obtains the sub- character of several samples for being split to sample string data String;It counts the number and/or at least two adjacent sample substrings that several sample substrings individually occur while going out Existing number;Calculate probability and/or at least two adjacent samples that several described sample substrings individually occur The probability that substring occurs simultaneously, obtains probability dictionary;It wherein, include several sample substrings and institute in probability dictionary State probability that several sample substrings individually occur and/or comprising at least two adjacent sample substrings and the phase The probability that at least two adjacent sample substrings occur simultaneously.
14. the device that the character string according to claim 13 to Mass production is identified, which is characterized in that the sample The type of this string data is identical as the character string type to be identified of the Mass production.
15. the device that the character string according to claim 12 to Mass production is identified, which is characterized in that described true Cover half block determines the probability that the character string to be identified occurs also particularly useful for the probability occurred according to the substring;Root According to the probability that the character string to be identified occurs, the degree of randomness of the character string to be identified is determined.
16. the device that the character string according to claim 15 to Mass production is identified, which is characterized in that described true Cover half block will be described also particularly useful under the probability scenarios that the substring for obtaining the character string to be identified individually occurs The probability P that the probabilistic geometry average value that substring individually occurs occurs as the character string to be identified;Or obtain it is described Under the probability scenarios that at least two adjacent substrings of character string to be identified occur simultaneously, by described adjacent at least two The probability P that the probabilistic geometry average value that substring occurs simultaneously occurs as the character string to be identified;Or obtain it is described The sub- character of adjacent at least two of probability and the character string to be identified that the substring of character string to be identified individually occurs Under the probability scenarios that string occurs simultaneously, probabilistic geometry average value that the substring is individually occurred with it is described it is adjacent at least The arithmetic average for the probabilistic geometry average value that two substrings occur simultaneously occurs general as the character string to be identified Rate P.
17. the device that the character string according to claim 16 to Mass production is identified, which is characterized in that described true Cover half block, what character string to be identified described in the degree of randomness R=1- also particularly useful for the determination character string to be identified occurred Probability P.
18. the device that the character string according to claim 17 to Mass production is identified, which is characterized in that described to sentence Disconnected module, specifically in the case where the degree of randomness R of the character string to be identified is greater than and presets random threshold value, it is described to Identification string is the character string generated at random.
19. the device that the character string according to claim 18 to Mass production is identified, which is characterized in that described pre- If predetermined probabilities threshold value described in random threshold value=1-;Wherein, the predetermined probabilities threshold value is several samples in the probability dictionary The median for the probability that this substring individually occurs;Or adjacent at least two sample substrings are same in the probability dictionary When the median of probability that occurs;Or the median for the probability that several sample substrings individually occur in the probability dictionary The arithmetic mean number of the median for the probability that at least two sample substrings adjacent with the probability dictionary occur simultaneously.
20. the device that the character string according to claim 18 to Mass production is identified, which is characterized in that the dress Set further include: emphasis prevention and control module, in the case where determining the character string that character string to be identified is randomly generated, to this with The character string that machine generates carries out emphasis prevention and control;Wherein, the emphasis prevention and control include binding authority, reinforce verifying and/or forbid stepping on At least one of land.
21. the equipment that the character string of a kind of pair of Mass production is identified, comprising: memory and processor, the memory Program is stored, and is configured to require the word to Mass production described in any one of 1-10 as the processor perform claim Symbol string carries out knowledge method for distinguishing.
CN201811074092.2A 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch Active CN109359274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811074092.2A CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811074092.2A CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Publications (2)

Publication Number Publication Date
CN109359274A true CN109359274A (en) 2019-02-19
CN109359274B CN109359274B (en) 2023-05-02

Family

ID=65350758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811074092.2A Active CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Country Status (1)

Country Link
CN (1) CN109359274B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765973A (en) * 2019-10-31 2020-02-07 上海掌门科技有限公司 Account type identification method and device
US20230004645A1 (en) * 2019-11-28 2023-01-05 Nippon Telegraph And Telephone Corporation Labeling device and labeling program

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207987A (en) * 1997-01-28 1998-08-07 Nec Telecom Syst Ltd Hand-written character recognition device
JPH11175664A (en) * 1997-12-08 1999-07-02 Fujitsu Ltd Device and method for recognizing character and program-storing medium
JP2010170252A (en) * 2009-01-21 2010-08-05 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for creating language model
CN103077389A (en) * 2013-01-07 2013-05-01 华中科技大学 Text detection and recognition method combining character level classification and character string level classification
US20130202208A1 (en) * 2012-02-06 2013-08-08 Casio Computer Co., Ltd. Information processing device and information processing method
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN106899411A (en) * 2016-12-08 2017-06-27 阿里巴巴集团控股有限公司 A kind of method of calibration and device based on identifying code
CN107679401A (en) * 2017-09-04 2018-02-09 北京知道未来信息技术有限公司 A kind of malicious web pages recognition methods and device
CN108288078A (en) * 2017-12-07 2018-07-17 腾讯科技(深圳)有限公司 Character identifying method, device and medium in a kind of image
CN108470126A (en) * 2018-03-19 2018-08-31 腾讯科技(深圳)有限公司 Data processing method, device and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207987A (en) * 1997-01-28 1998-08-07 Nec Telecom Syst Ltd Hand-written character recognition device
JPH11175664A (en) * 1997-12-08 1999-07-02 Fujitsu Ltd Device and method for recognizing character and program-storing medium
JP2010170252A (en) * 2009-01-21 2010-08-05 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for creating language model
US20130202208A1 (en) * 2012-02-06 2013-08-08 Casio Computer Co., Ltd. Information processing device and information processing method
CN103077389A (en) * 2013-01-07 2013-05-01 华中科技大学 Text detection and recognition method combining character level classification and character string level classification
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system
CN106899411A (en) * 2016-12-08 2017-06-27 阿里巴巴集团控股有限公司 A kind of method of calibration and device based on identifying code
CN107679401A (en) * 2017-09-04 2018-02-09 北京知道未来信息技术有限公司 A kind of malicious web pages recognition methods and device
CN108288078A (en) * 2017-12-07 2018-07-17 腾讯科技(深圳)有限公司 Character identifying method, device and medium in a kind of image
CN108470126A (en) * 2018-03-19 2018-08-31 腾讯科技(深圳)有限公司 Data processing method, device and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765973A (en) * 2019-10-31 2020-02-07 上海掌门科技有限公司 Account type identification method and device
CN110765973B (en) * 2019-10-31 2023-07-04 上海掌门科技有限公司 Account type identification method and device
US20230004645A1 (en) * 2019-11-28 2023-01-05 Nippon Telegraph And Telephone Corporation Labeling device and labeling program

Also Published As

Publication number Publication date
CN109359274B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN109597983B (en) Spelling error correction method and device
CN103336766A (en) Short text garbage identification and modeling method and device
CN105740667A (en) User behavior based information identification method and apparatus
CN104951542A (en) Method and device for recognizing class of social contact short texts and method and device for training classification models
CN106610931B (en) Topic name extraction method and device
CN111309910A (en) Text information mining method and device
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN110533018A (en) A kind of classification method and device of image
CN111753290A (en) Software type detection method and related equipment
CN110138794A (en) A kind of counterfeit website identification method, device, equipment and readable storage medium storing program for executing
CN104778283A (en) User occupation classification method and system based on microblog
CN110913354A (en) Short message classification method and device and electronic equipment
CN110705250A (en) Method and system for identifying target content in chat records
WO2014171925A1 (en) Event summarization
CN110263817B (en) Risk grade classification method and device based on user account
CN109359274A (en) The method, device and equipment that the character string of a kind of pair of Mass production is identified
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN110532773B (en) Malicious access behavior identification method, data processing method, device and equipment
CN114024761A (en) Network threat data detection method and device, storage medium and electronic equipment
CN105808602B (en) Method and device for detecting junk information
Alneyadi et al. A semantics-aware classification approach for data leakage prevention
CN105677677A (en) Information classification and device
CN112016317A (en) Sensitive word recognition method and device based on artificial intelligence and computer equipment
CN111488452A (en) Webpage tampering detection method, detection system and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230307

Address after: 801-10, Section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant after: Ant financial (Hangzhou) Network Technology Co.,Ltd.

Address before: 27 Hospital Road, George Town, Grand Cayman ky1-9008

Applicant before: Innovative advanced technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant