CN108960952A - A kind of detection method and device of violated information - Google Patents

A kind of detection method and device of violated information Download PDF

Info

Publication number
CN108960952A
CN108960952A CN201710375296.9A CN201710375296A CN108960952A CN 108960952 A CN108960952 A CN 108960952A CN 201710375296 A CN201710375296 A CN 201710375296A CN 108960952 A CN108960952 A CN 108960952A
Authority
CN
China
Prior art keywords
participle
detected
information
class description
violated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710375296.9A
Other languages
Chinese (zh)
Inventor
李大霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710375296.9A priority Critical patent/CN108960952A/en
Publication of CN108960952A publication Critical patent/CN108960952A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0607Regulated

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The application Internet technology, in particular to a kind of detection method and device of violated information are used to improve the accuracy rate and efficiency of violated infomation detection, this method are as follows: by the way of participle operation, participle set to be detected is proposed from the corresponding at least two class description informations of product to be detected, frequency of occurrence again based at least one participle to be detected, when determining that above-mentioned at least two class description informations include that the probability of violated information reaches setting threshold value, determine that in above-mentioned at least two class description informations include violated information.In this way, participle to be detected can accurately be reappeared, do not exist together to effectively prevent product provider that will be hidden in after the segmentation of violated information, substantially increase detection accuracy, meanwhile it can effectively reduce detection complexity and detection calculation amount, and then significant improve detection efficiency.

Description

A kind of detection method and device of violated information
Technical field
The application Internet technology, in particular to a kind of detection method and device of violated information.
Background technique
With the development of e-commerce technology, all kinds of commodity provided by e-commerce website become increasingly abundant, and correspondingly, use The product of itself needs is also increasingly got used to searching for, buying in e-commerce website in family.
However, in order to earn more profits, some suppliers can take a risk to sell contraband in e-commerce website, e.g., Gun, drug, drugs etc..Meanwhile in order to evade the contraband search function of e-commerce website, supplier is often being produced Product is described using obscure mode in product title, e.g., gun are described as " toy-rifle ", drug is described as " treatment product " etc. Deng.
For above situation, under the relevant technologies, Website server generallys use Keywords matching mode and carries out violated product examine It surveys.However, using Keywords matching mode, Detection accuracy and detection efficiency are very low, this is because supplier can at any time more The description vocabulary renewed greatly reduces detection efficiency so that needing to waste a large amount of manpower carries out keyword extraction and screening, And since the extraction of keyword and screening have definitely hysteresis quality, it often can not accurately and timely detect contraband.
Summary of the invention
The purpose of the application is to provide a kind of detection method of violated information, for improving the accuracy rate of violated infomation detection And efficiency.
Technical solution provided by the embodiments of the present application is as follows:
A kind of detection method of violated information, comprising:
It determines product to be detected, and obtains at least two class description informations of corresponding product setting;
Participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, is obtained to be detected Participle set;
Frequency of occurrence based at least one participle to be detected that the participle set to be detected includes, calculates described to be checked Survey the probability that participle set includes violated information;
When determining that the probability reaches setting threshold value, determine that in at least two class description informations include violated letter Breath.
Optionally, after obtaining at least two class description informations, before carrying out participle division, further comprise:
Based on preset text conversion condition, Text Pretreatment is carried out at least two class description informations of acquisition.
Optionally, the Text Pretreatment includes any one in following operation or any combination:
By capitalization lower all in at least two class description informations of acquisition;
Double byte character all in at least two class description informations of acquisition is converted into half-angle character;
It is simplified text by traditional font text conversion all in at least two class description informations of acquisition;
By the specified spcial character removal in at least two class description informations of acquisition;
By the specified vocabulary removal in at least two class description informations of acquisition.
Optionally, participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, obtained Obtain participle set to be detected, comprising:
According to preset different grain size, participle is carried out to the text for including at least two class description informations described in acquisition and is drawn Point, it obtains and divides participle set, wherein the punctuate mode that the particle size Lambda characterization participle uses when dividing;
The participle obtained will be divided from least two class description informations and carries out any combination, obtains combination participle set;
Division participle set and combination participle set are summarized, participle set to be detected is obtained.
Optionally, the frequency of occurrence at least one participle to be detected for including based on the participle set to be detected, calculates The participle set to be detected includes the probability of violated information, comprising:
The participle to be detected used is selected from the participle set to be detected, and determines going out for the participle to be detected used Occurrence number;
Determine that the corresponding preset weight of participle to be detected used, the weight are instructed using participle sample to be detected It is obtained after white silk study;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes The probability of violated information.
Optionally, before selecting the participle to be detected used in the participle set to be detected, further comprise:
Count the frequency of occurrence to be detected for segmenting the participle to be detected for including;
The most N number of participle to be detected of frequency of occurrence is filtered out, the N is parameter preset.
Optionally, the corresponding preset weight of participle to be detected used is determined, comprising: determine corresponding to be detected point used The weight of at least one preset classification of word;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes The probability of violated information, comprising: the weight of frequency of occurrence and at least one classification based on the participle to be detected used, needle To at least one described classification, the probability comprising violated information in the participle set to be detected is calculated.
Optionally, when determining that the probability reaches setting threshold value, include in the judgement at least two class description informations Violated information, comprising:
If calculating the probability comprising violated information in the participle set to be detected, then in determination only for a classification When the probability reaches setting threshold value, determine to include disobeying for one classification in at least two class description informations Prohibit information;
If being directed at least two classifications, the probability comprising violated information in the participle set to be detected is calculated separately, then When determining that maximum probability reaches setting threshold value, determine include in at least two class description informations for it is described most probably The violated information of the corresponding classification of rate.
A kind of information detecting method, comprising:
Obtain the corresponding at least two class description informations of pre-set product;
Obtain participle set corresponding at least two class description informations;
Frequency of occurrence and corresponding classification based at least one participle that the participle set includes, judge the participle Whether set includes target information.
Optionally, at least two class description informations include: the name of product information of the pre-set product, product classification letter Any two class in breath, product specifying information.
Optionally, the frequency of occurrence and corresponding classification at least one participle for including based on the participle set, judgement Whether the participle set includes target information, comprising:
The corresponding weight of classification of frequency of occurrence and affiliated description information based on the participle to be detected calculates institute State the probability that participle set to be detected includes violated information.
A kind of storage medium, being stored with has program for realizing the detection method of violated information, and described program is by processor When operation, following steps are executed:
It determines product to be detected, and obtains at least two class description informations of corresponding product setting;
Participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, is obtained to be detected Participle set;
Frequency of occurrence based at least one participle to be detected that the participle set to be detected includes, calculates described to be checked Survey the probability that participle set includes violated information;
When determining that the probability reaches setting threshold value, determine that in at least two class description informations include violated letter Breath.
A kind of computer installation, including one or more processors;And
One or more computer-readable mediums are stored with instruction on the readable medium, and described instruction is one Or multiple processors are when executing, so that the computer installation executes method described in any of the above embodiments.
A kind of detection device of violated information characterized by comprising
Acquiring unit for determining product to be detected, and obtains at least two class description informations of corresponding product setting;
Processing unit, for carrying out participle behaviour according to text that at least two class description informations include of the setting means to acquisition Make, obtains participle set to be detected;
Computing unit, the occurrence out of at least one participle to be detected for including based on the participle set to be detected Number calculates the probability that the participle set to be detected includes violated information;
Judging unit when for determining that the probability reaches setting threshold value, determines in at least two class description informations It include violated information.
Optionally, after obtaining at least two class description informations, before carrying out participle division, the processing unit It is further used for:
Based on preset text conversion condition, Text Pretreatment is carried out at least two class description informations of acquisition.
Optionally, the Text Pretreatment includes any one in following operation or any combination:
By capitalization lower all in at least two class description informations of acquisition;
Double byte character all in at least two class description informations of acquisition is converted into half-angle character;
It is simplified text by traditional font text conversion all in at least two class description informations of acquisition;
By the specified spcial character removal in at least two class description informations of acquisition;
By the specified vocabulary removal in at least two class description informations of acquisition.
Optionally, participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, obtained When obtaining participle to be detected set, the processing unit is used for:
According to preset different grain size, participle is carried out to the text for including at least two class description informations described in acquisition and is drawn Point, it obtains and divides participle set, wherein the punctuate mode that the particle size Lambda characterization participle uses when dividing;
The participle obtained will be divided from least two class description informations and carries out any combination, obtains combination participle set;
Division participle set and combination participle set are summarized, participle set to be detected is obtained.
Optionally, the frequency of occurrence at least one participle to be detected for including based on the participle set to be detected, calculates When the participle set to be detected includes the probability of violated information, the computing unit is used for:
The participle to be detected used is selected from the participle set to be detected, and determines going out for the participle to be detected used Occurrence number;
Determine that the corresponding preset weight of participle to be detected used, the weight are instructed using participle sample to be detected It is obtained after white silk study;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes The probability of violated information.
Optionally, before selecting the participle to be detected that uses in the participle set to be detected, the computing unit into One step is used for:
Count the frequency of occurrence to be detected for segmenting the participle to be detected for including;
The most N number of participle to be detected of frequency of occurrence is filtered out, the N is parameter preset.
Optionally, when determining the preset weight of the corresponding participle to be detected used, the computing unit is used for: being determined and is corresponded to The weight of at least one the preset classification of participle to be detected used;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes When the probability of violated information, the computing unit is used for: the frequency of occurrence based on the participle to be detected used and described at least one The weight of a classification calculates the probability comprising violated information in the participle set to be detected at least one described classification.
Optionally, when determining that the probability reaches setting threshold value, include in the judgement at least two class description informations When violated information, the judging unit is used for:
If calculating the probability comprising violated information in the participle set to be detected, then in determination only for a classification When the probability reaches setting threshold value, determine to include disobeying for one classification in at least two class description informations Prohibit information;
If being directed at least two classifications, the probability comprising violated information in the participle set to be detected is calculated separately, then When determining that maximum probability reaches setting threshold value, determine include in at least two class description informations for it is described most probably The violated information of the corresponding classification of rate.
In the embodiment of the present application, by the way of participle operation, from the corresponding at least two classes description letter of product to be detected Participle set to be detected, then the frequency of occurrence based at least one participle to be detected are proposed in breath, determine that above-mentioned at least two classes are retouched When stating the probability that information includes violated information and reaching setting threshold value, determine that in above-mentioned at least two class description informations include violated Information.In this way, participle to be detected can be accurately reappeared by recombinating from the participle of inhomogeneity description information, thus It effectively prevent product provider that will be hidden in after the segmentation of violated information not existing together, substantially increases detection accuracy, meanwhile, it is based on Frequency of occurrence of the participle to be detected in different description informations, calculates the probability comprising violated information, can effectively reduce detection Complexity and detection calculation amount, and then significant improve detection efficiency.
Detailed description of the invention
Fig. 1 is to carry out violated infomation detection flow chart in the embodiment of the present application;
Fig. 2 is description information distribution schematic diagram in the embodiment of the present application;
Fig. 3 is in the embodiment of the present application for detecting the apparatus structure schematic diagram of violated information.
Specific embodiment
In order to improve the accuracy rate and efficiency of violated infomation detection, in the embodiment of the present application, by the way of participle operation, Participle set to be detected is proposed from the corresponding at least two class description informations of product to be detected, then to be detected based at least one The frequency of occurrence of participle determines that above-mentioned at least two class description informations include the probability of violated information.
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, is not whole embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
As shown in fig.1, the detailed process detected to violated information is as follows in the embodiment of the present application:
Step 100: determining product to be detected, and obtain at least two class description informations of corresponding product setting.
In practical application, the description information of so-called product refers to title, form, the model, function for describing product The relevant information of energy, application environment etc.;In the embodiment of the present application, product title and production are included at least in the description information of product Information is discussed in detail in product, it is generally the case that product title is arranged near product picture, and information is discussed in detail in product, puts The webpage blank space below product picture is set, referring particularly to shown in Fig. 2.
It further, can also be comprising that can also include product classification and product provider's identity in the description information of product Information is specifically also seen shown in Fig. 2.
Since product title is the place for being easiest to be checked, product provider is possible to violated information Other positions are hidden in, to escape detection.Therefore in the embodiment of the present application, need at least two class description informations of product into The comprehensive detection of row, to improve detection accuracy.
Step 110: being based on preset text conversion condition, text is carried out at least two class description informations of acquisition and is located in advance Reason.
In the embodiment of the present application, why need to carry out Text Pretreatment, is because being produced to evade violated infomation detection Product provider is possible to be hidden processing to the violated information that all kinds of description informations include, e.g., using capital and small letter promiscuous mode Setting, is configured using full-shape half-angle promiscuous mode, is configured using either traditional and simplified characters promiscuous mode, spcial character is added or stops Word opens violated information interval etc..
Correspondingly, describing letter according at least two classes of the preset text conversion condition to acquisition in the embodiment of the present application When breath carries out Text Pretreatment, it can use but be not limited to following manner:
Mode 1: by capitalization lower all in at least two class description informations of acquisition.
Specifically, the coding of letter is very regular, between different letters, the capital and small letter of the same letter is encoded apart from identical, Such as, for same letter, the ascii value of small letter is bigger by 32 than the ascii value of capitalization, therefore, can carry out according to this rule big The conversion of lowercase.
Such as: " AbCdEFG " is converted into " abcdefg ".
Mode 2: double byte character all in at least two class description informations of acquisition is converted into half-angle character.
Specifically, the definition of double byte character is that unicode is encoded from 0xFF01~0xFF5E, corresponding half-angle character Unicode is encoded from 0x21~0x7E, and space is more special, full-shape 0x3000, half-angle 0x20;In addition to space, Fully Formed Character Symbol is corresponding by unicode coding and sorting order sequence with the conversion of half-angle character, is converted according to rule.
Such as: " d r u g s " is converted into " drugs ".
Mode 3: being simplified text by traditional font text conversion all in at least two class description informations of acquisition.
Specifically, can according to preset traditional font and it is simplified between mapping table converted.
Such as: " excitant " is converted into " excitant ".
Mode 4: specified spcial character all in at least two class description informations of acquisition is removed.
Specifically, the blacklist of the spcial character comprising removal in need can be set, and spy is carried out based on the blacklist The removal of different character.
For example, " emerging ...-- agent of putting forth energy " is converted to " excitant ", " sub- %$ bullet " is converted into " bullet ".
Mode 5: specified vocabulary all in at least two class description informations of acquisition is removed.
The vocabulary for needing to remove is usually meaningless auxiliary words of mood or adversative, can be referred to as stop words, e.g., some Common stop words include " ", " ", " regardless of ", " therefore " etc., stop words list can be preset, to compare at any time To removal.
Mode 5: for example, " emerging agent of having put forth energy " is converted to " excitant ", " son therefore bullet " is converted into " bullet ".
Certainly, when carrying out Text Pretreatment, any one in above-mentioned 5 kinds of modes or any combination can be used.And or Person further can also realize Text Pretreatment using other modes, and above-mentioned 5 kinds of modes are only for example, no longer superfluous herein It states.
Certainly, however, it is determined that the character or text for including in at least two class description informations of acquisition will not influence last detection As a result, can not also then execute Text Pretreatment, the present embodiment is only for example, and is also repeated no more herein.
Step 120: carrying out participle operation according to text that at least two class description informations include of the setting means to acquisition, obtain Obtain participle set to be detected.
In practical application, since product provider is possible to by violated Information hiding in different types of description information, It is possible to not obtain accurate participle to be detected, therefore, this Shen so only carrying out simply participle to description information and dividing It please need to be combined from the participle of inhomogeneity description information in embodiment, in short, the operation of above-mentioned participle includes, Participle divides and participle two generic operations of combination.
Firstly, it is necessary to carry out participle according to different grain size to the text for including in at least two class description informations of acquisition and draw Point, it obtains and divides participle set.
It is so-called to carry out participle division according to different grain size, its essence is: it can differently (e.g., with passage Different pause modes, different punctuate modes) mark off different participles, i.e., same text can be used for multiple times.
Such as: being discussed in detail in information in product there are passage is " gummy reverse mould tool ", it is possible to by this Section text is divided into " natural gum ", " reverse mould ", " tool ", and " gummy reverse mould ", " reverse mould tool " and " gummy reverse mould tool ".This Sample, can to the greatest extent can ground cover all participle situations, to avoid subsequent detection in the process cause missing inspection the case where.
Secondly, needing (will to wrap in above-mentioned division participle set from dividing the participle obtained at least two class description informations The each participle contained) any combination (can be with combination of two, three or three combinations etc.) be carried out, obtain combination participle set.Certainly, come It can be distinguished from the same participle in inhomogeneity description information using prefix,
For example, it is assumed that the division participle obtained from product title is " natural gum " and " pour mask ", and from product provider's body The division participle obtained in part information is " lockset " and " company ", then, after any combination between participle, it is assumed that obtain Combination participle be " gummy lockset ", " lockset pour mask ", " gummy company " and " pour mask company ", then, these combination segments just Combination participle set can be formed, and " lockset pour mask " this combination therein participle has and can be related to the system of unlocking tool Make, therefore, it is possible to be violated information.
Summarize finally, participle set will be divided and combine participle set, obtains participle set to be detected.
In the embodiment of the present application, dividing participle set and combination participle set is important participle to be detected, in order to mention It for subsequent detection accuracy, needs to summarize these two types participle set, to obtain participle set to be detected.
Step 130: calculating the frequency of occurrence of each participle to be detected in participle set to be detected, filter out frequency of occurrence Most N number of participle to be detected, wherein N is parameter preset.
Assuming that N=10, then can carry out histogram for each participle to be detected recorded in participle set to be detected Statistics, determines the frequency of occurrence of each participle to be detected, then by each participle to be detected according to frequency of occurrence from big to small Sequence is ranked up, finally, the participle to be detected of ranking preceding 10 is filtered out, to be further processed.
Due to the frequency of occurrence of participle to be detected, the importance of participle to be detected can be represented, therefore, in participle to be detected A fairly large number of situation under, need first to screen the more participle to be detected of frequency of occurrence, to improve subsequent detection Efficiency.
On the other hand, if the number of participle to be detected is less, all detection will not influence detection efficiency, alternatively, in order to It prevents product provider from hiding violated information in the less participle to be detected of frequency of occurrence, then can not also execute step 130. It in the embodiment of the present application, is only illustrated for executing step 130, details are not described herein.
Step 140: all participles to be detected are subjected to text vector.
Specifically, participle to be detected can be subjected to text vector by the way of " index+index value ", it e.g., can be with The participle to be detected after vectorization is indicated in the form of " index1:value1index2:value2 ... ", wherein index can Think that the unique serial number of participle to be detected, value are the frequency of occurrence of participle to be detected;Wherein, if performing step 130, If some participle to be detected can be used as invalid word and ignore in not being screened.
Step 150: the frequency of occurrence based at least one participle to be detected that participle set to be detected includes calculates above-mentioned At least two class description informations include the probability of violated information.
Specifically, can be by the text input after vectorization into preparatory trained model, in this way, each is to be detected The frequency of occurrence of participle can all determine the value of last probability.
Optionally, when calculating the probability comprising violated information, can first determine needs participle to be detected to be used, In, carried out by need participle to be detected to be used, can be filter out relevant to setting classification participle to be detected (i.e. at least One participle to be detected), it is also possible to whole participles to be detected.
It is then determined the frequency of occurrence of each participle to be detected used;
Then, it is determined that the corresponding preset weight of participle to be detected used, wherein so-called weight is using participle to be detected Sample obtains after being trained study;
Finally, frequency of occurrence and corresponding weight based on each participle to be detected used, calculate above-mentioned acquisition It include the probability of violated information at least two class description informations.
For example, can be, but not limited to be calculated using the following equation above-mentioned probability.
P=W1N1+W2N2+WiNm... formula one
Wherein, P indicates probability, WiIndicate i-th of weight, NmIndicate the frequency of occurrence of m-th of the participle to be detected used.
Under normal conditions, there are incidence relations, i.e. W between weight and participle to be detectediAnd NmBetween relationship be phase To stable, the corresponding weight of a participle to be detected is obtained by repetitive exercise.
It, can respectively will be after text vector used in each round iterative process specifically, during sample training Participle compound mapping to be detected be a point in hyperspace, enable multiple points of acquisition reach target line or objective plane Sum of the distance it is most short be to terminate target, the parameter in the expression formula of the target line or objective plane that obtain at this time is some The Configuration Values of weight.
For example, for ease of description, by taking 2 dimension coordinate systems as an example.Assuming that the participle set X1 to be detected after text vector Are as follows: 1:2 2:4;Wherein, the digital representation dimension in every set of number before ": ", the digital representation coordinate value after ": ", that , " 1:2 2:4 " then indicates that abscissa is 2, the point that ordinate is 4, i.e., (2,4).
(2,4) are mapped into 2 dimension coordinate systems, and determine that the initializer of target line X is y=wx+b (assuming that w and b Initial value be preset value), then, calculate (2,4) to target line X distance be L1.
Then, continue to obtain the subsequent participle set X2 to be detected as sample, it is assumed that reflected using same way Point after penetrating is (3,5), then, the expression formula for enabling the shortest straight line of the sum of (2,4) and (3,5) reach is calculated, and right The initializer of target line X is adjusted, and obtains updated expression formula y=w ' x+b '
And so on, it can continue to continue using participle set X3 to be detected, participle set X4 ... to be detected to target The expression formula of straight line X is iterated update, the difference of the parameter value of the expression formula until updated expression formula and before updating Until given threshold.
So, at the end of iteration, the newest value of w can be obtained, it is then determined the corresponding preset participle classification of w, it will The newest value of w is set as the weight with the participle to be detected of identical participle classification, and due to the power of each participle to be detected The sum of value is 1, then 1-w is the weight of another participle to be detected.
Above-mentioned weight setting up procedure is only by taking two participles to be detected as an example, if as wrapping in the participle set to be detected of sample In containing when more than two multiple participles to be detected, then after mapping that multidimensional coordinate system and being iterated, it can obtain The weight of the participle to be detected of multiple and different participle classifications, details are not described herein.
Correspondingly, the value of each weight can be arranged using aforesaid way in formula one, can also periodically learn more Newly, details are not described herein.
Step 160: when determining that the probability for calculating and obtaining reaches setting threshold value, determining in above-mentioned at least two class description informations It include violated information.
Specifically, the weight of some participle to be detected has when calculating the probability comprising violated information using formula one The value of above-mentioned probability may be significantly improved.
For example, it is assumed that participle to be detected collection is combined into { toy occurs, firelock, pour mask, }, respective frequency of occurrence is 6, 7,5,4, and its respective weight is 0.3,0.2,0.9,0.5, then, it is calculated and is normalized using formula one Afterwards, it is assumed that the obtained probability comprising violated information is 97%, is greater than predetermined threshold value (being assumed to be 80%), then can work as certainly It detected violated information in the corresponding at least two class description informations of preceding product to be detected, and based on default classification, " fire Rifle " belongs to the probability highest of gun class, then can be determined that violated information belongs to " gun class ".
In above-described embodiment, only it is illustrated for segmenting and there is a kind of weight.
And during hands-on, optionally, each participle in training pattern may be belonged to different classes Not, then, a participle can have corresponding weight under each classification, in this way, to be detected after text vector After participle is input to training pattern, corresponding probability can be exported for each classification respectively.In short, corresponding use can be determined At least one preset classification of participle to be detected weight, then the frequency of occurrence based on the participle to be detected used and it is above-mentioned extremely The weight of a few classification calculates the probability comprising violated information in participle set to be detected at least one above-mentioned classification.
So, participle to be detected is bigger for the probability value of a classification, and it includes the violated information under this classification Possibility is bigger.
For example, refering to shown in table 1, altogether comprising 11 participles in training sample, be respectively as follows: China, cigarette, copybook, I, Love, cross embroidery, tree peony, firelock, toy, lint, children, the classification of required classification have 3, be respectively as follows: particularly for tobacco items, gun class, Normal class, then, after overfitting, each participle obtained in training pattern is as follows in the weight of each classification:
Table 1
Participle Particularly for tobacco items Gun class Normal class
China 0.4 0.1 0.5
Cigarette 0.9 0.04 0.06
Copybook 0.1 0.2 0.7
I
Love
Cross embroidery
Tree peony
Firelock 0.03 0.9 0.07
Toy
Lint
Children
It is assumed that the test case of subsequent acquisition are as follows: " sell Chinese board cigarette ", then divided from test case obtain to Detection participle are as follows: sale, China, board, cigarette;The method then introduced respectively according to step 100- step 160 calculates test and uses Example (participle set i.e. to be detected) is as follows in the score value of each classification:
Score (particularly for tobacco items)=w particularly for tobacco items (sale) * x particularly for tobacco items (sale)+w particularly for tobacco items (China) * x particularly for tobacco items (in China)+w particularly for tobacco items (board) * x particularly for tobacco items (board)+w particularly for tobacco items (cigarette) * x particularly for tobacco items (cigarette)=0*0+0.4*1+0*0+0.9*1 =1.3.
Score (gun class)=w gun class (sale) * x gun class (sale)+w gun class (China) * x gun class (in China)+w gun class (board) * x gun class (board)+w gun class (cigarette) * x gun class (cigarette)=0*0+0.1*1+0*0+0.04*1 =0.14.
The normal class of Score (normal class)=w normal class (sale) * x normal class (sale)+w normal class (China) * x (in China)+w normal class (board) * x normal class (the board)+w normal class normal class of (cigarette) * x (cigarette)=0*0+0.5*1+0*0+0.06*1 =0.56.
Finally, 3 score values are normalized to [0,1] section, the probability of 3 classifications can must be adhered to separately:
Prob (particularly for tobacco items)=1.3/ (1.3+0.14+0.56)=0.65;
Prob (gun class)=0.14/ (1.3+0.14+0.56)=0.07;
Prob (normal class)=0.56/ (1.3+0.14+0.56)=0.28;
As can be seen that the maximum probability of particularly for tobacco items, and reach setting threshold value (0.5), therefore differentiate the test case packet Violated vocabulary containing particularly for tobacco items.
Based on above scheme, violated information not only can be effectively detected from the inhomogeneity description information of product, and It can also accurately know the classification that violated information is belonged to.
In conjunction with above scheme, it can be seen that in the application one embodiment, if only for a classification, calculate it is described to The probability comprising violated information in participle set is detected, then when determining that the probability reaches setting threshold value, judgement is described extremely It include the violated information for one classification in few two class description informations;
If being directed at least two classifications, the probability comprising violated information in the participle set to be detected is calculated separately, then When determining that maximum probability reaches setting threshold value, determine include in at least two class description informations for it is described most probably The violated information of the corresponding classification of rate.
Based on the above embodiment, further, in one embodiment, a kind of information detecting method can also be provided, wrapped It includes:
Obtain the corresponding at least two class description informations of pre-set product;
Obtain participle set corresponding at least two class description informations;
Frequency of occurrence and corresponding classification based at least one participle that the participle set includes, judge the participle Whether set includes target information.
Optionally, at least two class description informations include: the name of product information of the pre-set product, product classification letter Any two class in breath, product specifying information.
Optionally, the frequency of occurrence and corresponding classification at least one participle for including based on the participle set, judgement Whether the participle set includes target information, comprising:
The corresponding weight of classification of frequency of occurrence and affiliated description information based on the participle to be detected calculates institute State the probability that participle set to be detected includes violated information.
Further, in the application one embodiment, a kind of storage medium is also provided, is stored with for realizing violated letter The detection method of breath has program, when described program is run by processor, executes following steps:
It determines product to be detected, and obtains at least two class description informations of corresponding product setting;
Participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, is obtained to be detected Participle set;
Frequency of occurrence based at least one participle to be detected that the participle set to be detected includes, calculates described to be checked Survey the probability that participle set includes violated information;
When determining that the probability reaches setting threshold value, determine that in at least two class description informations include violated letter Breath.
Further, in the application one embodiment, a kind of computer installation, including one or more processing are also provided Device;And
One or more computer-readable mediums are stored with instruction on the readable medium, and described instruction is one Or multiple processors are when executing, so that the computer installation executes method described in any of the above embodiments.
Further, as shown in fig.3, in the application one embodiment, the device for detecting violated information (is examined Survey device) include at least acquiring unit 30, processing unit 31, computing unit 32 and judging unit 33, in which:
Acquiring unit 30 for determining product to be detected, and obtains at least two classes description letter of corresponding product setting Breath;
Processing unit 31, for being segmented according to text that at least two class description informations include of the setting means to acquisition Operation obtains participle set to be detected;
Computing unit 32, the occurrence out of at least one participle to be detected for including based on the participle set to be detected Number calculates the probability that the participle set to be detected includes violated information;
Judging unit 33 when for determining that the probability reaches setting threshold value, determines at least two class description informations In include violated information.
Optionally, after obtaining at least two class description informations, before carrying out participle division, processing unit 31 into One step is used for:
Based on preset text conversion condition, Text Pretreatment is carried out at least two class description informations of acquisition.
Optionally, the Text Pretreatment includes any one in following operation or any combination:
By capitalization lower all in at least two class description informations of acquisition;
Double byte character all in at least two class description informations of acquisition is converted into half-angle character;
It is simplified text by traditional font text conversion all in at least two class description informations of acquisition;
By the specified spcial character removal in at least two class description informations of acquisition;
By the specified vocabulary removal in at least two class description informations of acquisition.
Optionally, participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, obtained When obtaining participle to be detected set, processing unit 31 is used for:
According to preset different grain size, participle is carried out to the text for including at least two class description informations described in acquisition and is drawn Point, it obtains and divides participle set, wherein the punctuate mode that the particle size Lambda characterization participle uses when dividing;
The participle obtained will be divided from least two class description informations and carries out any combination, obtains combination participle set;
Division participle set and combination participle set are summarized, participle set to be detected is obtained.
Optionally, the frequency of occurrence at least one participle to be detected for including based on the participle set to be detected, calculates When the participle set to be detected includes the probability of violated information, computing unit 32 is used for:
The participle to be detected used is selected from the participle set to be detected, and determines going out for the participle to be detected used Occurrence number;
Determine that the corresponding preset weight of participle to be detected used, the weight are instructed using participle sample to be detected It is obtained after white silk study;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes The probability of violated information.
Optionally, before selecting the participle to be detected used in the participle set to be detected, computing unit 32 into one Step is used for:
Count the frequency of occurrence to be detected for segmenting the participle to be detected for including;
The most N number of participle to be detected of frequency of occurrence is filtered out, the N is parameter preset.
Optionally, when determining the preset weight of the corresponding participle to be detected used, computing unit 32 is used for: determining that correspondence makes The weight of at least one preset classification of participle to be detected;
Frequency of occurrence and corresponding weight based on the participle to be detected used, calculating the participle set to be detected includes When the probability of violated information, computing unit 32 is used for: frequency of occurrence based on the participle to be detected used and it is described at least one The weight of classification calculates the probability comprising violated information in the participle set to be detected at least one described classification.
Optionally, when determining that the probability reaches setting threshold value, include in the judgement at least two class description informations When violated information, judging unit 33 is used for:
If calculating the probability comprising violated information in the participle set to be detected, then in determination only for a classification When the probability reaches setting threshold value, determine to include disobeying for one classification in at least two class description informations Prohibit information;
If being directed at least two classifications, the probability comprising violated information in the participle set to be detected is calculated separately, then When determining that maximum probability reaches setting threshold value, determine include in at least two class description informations for it is described most probably The violated information of the corresponding classification of rate.
In conclusion in the embodiment of the present application, by the way of participle operation, from product to be detected corresponding at least two Participle to be detected set, then the frequency of occurrence based at least one participle to be detected are proposed in class description information, determine it is above-mentioned extremely When few two class description informations include that the probability of violated information reaches setting threshold value, determine to wrap in above-mentioned at least two class description informations Contain violated information.In this way, can accurately reappear to be detected point for recombinating from the participle of inhomogeneity description information Word does not exist together to effectively prevent product provider that will be hidden in after the segmentation of violated information, substantially increases detection accuracy, together When, the frequency of occurrence based on participle to be detected in different description informations calculates the probability comprising violated information, can effectively contract Subtract detection complexity and detection calculation amount, and then significant improves detection efficiency.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the application range.
Obviously, those skilled in the art can carry out various modification and variations without departing from this Shen to the embodiment of the present application Please embodiment spirit and scope.In this way, if these modifications and variations of the embodiment of the present application belong to the claim of this application And its within the scope of equivalent technologies, then the application is also intended to include these modifications and variations.

Claims (14)

1. a kind of detection method of violated information characterized by comprising
It determines product to be detected, and obtains at least two class description informations of corresponding product setting;
Participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, obtains participle to be detected Set;
Frequency of occurrence based at least one participle to be detected that the participle set to be detected includes, calculates described to be detected point Set of words includes the probability of violated information;
When determining that the probability reaches setting threshold value, determine that in at least two class description informations include violated information.
2. the method as described in claim 1, which is characterized in that after obtaining at least two class description informations, carrying out Before participle divides, further comprise:
Based on preset text conversion condition, Text Pretreatment is carried out at least two class description informations of acquisition.
3. method according to claim 2, which is characterized in that the Text Pretreatment includes any one in following operation Or any combination:
By capitalization lower all in at least two class description informations of acquisition;
Double byte character all in at least two class description informations of acquisition is converted into half-angle character;
It is simplified text by traditional font text conversion all in at least two class description informations of acquisition;
By the specified spcial character removal in at least two class description informations of acquisition;
By the specified vocabulary removal in at least two class description informations of acquisition.
4. method as claimed in claim 1,2 or 3, which is characterized in that described according at least two classes of the setting means to acquisition The text that information includes carries out participle operation, obtains participle set to be detected, comprising:
According to preset different grain size, participle division is carried out to the text for including at least two class description informations described in acquisition, It obtains and divides participle set, wherein the punctuate mode that the particle size Lambda characterization participle uses when dividing;
The participle obtained will be divided from least two class description informations and carries out any combination, obtains combination participle set;
Division participle set and combination participle set are summarized, participle set to be detected is obtained.
5. method as claimed in claim 1,2 or 3, which is characterized in that being gathered based on the participle to be detected includes at least The frequency of occurrence of one participle to be detected calculates the probability that the participle set to be detected includes violated information, comprising:
The participle to be detected used is selected from the participle set to be detected, and determines the occurrence out of the participle to be detected used Number;
Determine that the corresponding preset weight of participle to be detected used, the weight are to be trained using participle sample to be detected It is obtained after habit;
Frequency of occurrence and corresponding weight based on the participle to be detected used calculate the participle set to be detected comprising violated The probability of information.
6. method as claimed in claim 5, which is characterized in that selected from the participle to be detected set use it is to be detected Before participle, further comprise:
Count the frequency of occurrence to be detected for segmenting the participle to be detected for including;
The most N number of participle to be detected of frequency of occurrence is filtered out, the N is parameter preset.
7. method as claimed in claim 5, which is characterized in that determine the corresponding preset weight of participle to be detected used, packet It includes: determining the weight of corresponding at least one the preset classification of participle to be detected used;
Frequency of occurrence and corresponding weight based on the participle to be detected used calculate the participle set to be detected comprising violated The probability of information, comprising: the weight of frequency of occurrence and at least one classification based on the participle to be detected used, for institute At least one classification is stated, the probability comprising violated information in the participle set to be detected is calculated.
8. the method for claim 7, which is characterized in that when determining that the probability reaches setting threshold value, described in judgement It include violated information at least two class description informations, comprising:
If only for a classification, calculate include in the participle set to be detected violated information probability, then described in the determination When probability reaches setting threshold value, determine to include the violated letter for one classification in at least two class description informations Breath;
If being directed at least two classifications, the probability comprising violated information in the participle set to be detected is calculated separately, then true When determining maximum probability and reaching setting threshold value, determine that in at least two class description informations include for the maximum probability pair The violated information for the classification answered.
9. a kind of information detecting method characterized by comprising
Obtain the corresponding at least two class description informations of pre-set product;
Obtain participle set corresponding at least two class description informations;
Frequency of occurrence and corresponding classification based at least one participle that the participle set includes, judge the participle set It whether include target information.
10. method as claimed in claim 9, which is characterized in that at least two class description informations include: the pre-set product Name of product information, product classification information, any two class in product specifying information.
11. method as claimed in claim 9, which is characterized in that at least one for including based on the participle set segments Whether frequency of occurrence and corresponding classification judge the participle set comprising target information, comprising:
The corresponding weight of classification of frequency of occurrence and affiliated description information based on the participle to be detected, calculate it is described to Detection participle set includes the probability of violated information.
12. a kind of storage medium, which is characterized in that being stored with has program, the journey for realizing the detection method of violated information When sequence is run by processor, following steps are executed:
It determines product to be detected, and obtains at least two class description informations of corresponding product setting;
Participle operation is carried out according to text that at least two class description informations include of the setting means to acquisition, obtains participle to be detected Set;
Frequency of occurrence based at least one participle to be detected that the participle set to be detected includes, calculates described to be detected point Set of words includes the probability of violated information;
When determining that the probability reaches setting threshold value, determine that in at least two class description informations include violated information.
13. a kind of computer installation, which is characterized in that including one or more processors;And
One or more computer-readable mediums are stored with instruction on the readable medium, and described instruction is by one or more When a processor executes, so that the computer installation executes method as described in any one of claim 1 to 11.
14. a kind of detection device of violated information characterized by comprising
Acquiring unit for determining product to be detected, and obtains at least two class description informations of corresponding product setting;
Processing unit, for carrying out participle operation according to text that at least two class description informations include of the setting means to acquisition, Obtain participle set to be detected;
Computing unit, the frequency of occurrence of at least one participle to be detected for including based on the participle set to be detected, meter Calculate the probability that the participle set to be detected includes violated information;
Judging unit, when for determining that the probability reaches setting threshold value, determine include in at least two class description informations There is violated information.
CN201710375296.9A 2017-05-24 2017-05-24 A kind of detection method and device of violated information Pending CN108960952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710375296.9A CN108960952A (en) 2017-05-24 2017-05-24 A kind of detection method and device of violated information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710375296.9A CN108960952A (en) 2017-05-24 2017-05-24 A kind of detection method and device of violated information

Publications (1)

Publication Number Publication Date
CN108960952A true CN108960952A (en) 2018-12-07

Family

ID=64494291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710375296.9A Pending CN108960952A (en) 2017-05-24 2017-05-24 A kind of detection method and device of violated information

Country Status (1)

Country Link
CN (1) CN108960952A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797622A (en) * 2019-06-20 2020-10-20 北京沃东天骏信息技术有限公司 Method and apparatus for generating attribute information
CN112990938A (en) * 2019-12-17 2021-06-18 阿里巴巴集团控股有限公司 Method, device and system for detecting object

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN102411563A (en) * 2010-09-26 2012-04-11 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN102663025A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Illegal online commodity detection method
CN106383862A (en) * 2016-08-31 2017-02-08 杭州云片网络科技有限公司 Violation short message detection method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN102411563A (en) * 2010-09-26 2012-04-11 阿里巴巴集团控股有限公司 Method, device and system for identifying target words
CN102663025A (en) * 2012-03-22 2012-09-12 浙江盘石信息技术有限公司 Illegal online commodity detection method
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN106383862A (en) * 2016-08-31 2017-02-08 杭州云片网络科技有限公司 Violation short message detection method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797622A (en) * 2019-06-20 2020-10-20 北京沃东天骏信息技术有限公司 Method and apparatus for generating attribute information
CN111797622B (en) * 2019-06-20 2024-04-09 北京沃东天骏信息技术有限公司 Method and device for generating attribute information
CN112990938A (en) * 2019-12-17 2021-06-18 阿里巴巴集团控股有限公司 Method, device and system for detecting object

Similar Documents

Publication Publication Date Title
CN104391860B (en) content type detection method and device
CN105184160B (en) A kind of method of the Android phone platform application program malicious act detection based on API object reference relational graphs
CN108021806B (en) Malicious installation package identification method and device
CN107807987A (en) A kind of string sort method, system and a kind of string sort equipment
CN106815194A (en) Model training method and device and keyword recognition method and device
CN110175851B (en) Cheating behavior detection method and device
CN110287702A (en) A kind of binary vulnerability clone detection method and device
CN114065874B (en) Training method and device for appearance defect detection model of medical glass bottle and terminal equipment
CN110929203B (en) Abnormal user identification method, device, equipment and storage medium
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN105045715B (en) Leak clustering method based on programming mode and pattern match
CN109905385A (en) A kind of webshell detection method, apparatus and system
CN105205397A (en) Rogue program sample classification method and device
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN106845220A (en) A kind of Android malware detecting system and method
CN107368526A (en) A kind of data processing method and device
CN106778277A (en) Malware detection methods and device
CN110020430A (en) A kind of fallacious message recognition methods, device, equipment and storage medium
JP6419667B2 (en) Test DB data generation method and apparatus
CN106919576A (en) Using the method and device of two grades of classes keywords database search for application now
CN108960952A (en) A kind of detection method and device of violated information
CN113449753A (en) Service risk prediction method, device and system
CN105119910A (en) Template-based online social network rubbish information real-time detecting method
CN107066302A (en) Defect inspection method, device and service terminal
CN110989991A (en) Method and system for detecting source code clone open source software in application program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181207