CN109213859A - A kind of Method for text detection, apparatus and system - Google Patents
A kind of Method for text detection, apparatus and system Download PDFInfo
- Publication number
- CN109213859A CN109213859A CN201710549655.8A CN201710549655A CN109213859A CN 109213859 A CN109213859 A CN 109213859A CN 201710549655 A CN201710549655 A CN 201710549655A CN 109213859 A CN109213859 A CN 109213859A
- Authority
- CN
- China
- Prior art keywords
- text
- data
- inquiry
- identified
- behavioural characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0203—Market surveys; Market polls
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Abstract
The embodiment of the present application discloses a kind of Method for text detection, apparatus and system.The described method includes: the information content based on text to be identified generates text feature data;Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generate assemblage characteristic data;The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, the testing result of the text to be identified is determined according to processing result.Utilize each embodiment of the application, the accuracy of rubbish text identification can be improved, it can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, there is better accuracy, stability and timeliness, improve the safety of text information environment.
Description
Technical field
The application belongs to computer data processing technology field more particularly to a kind of Method for text detection, apparatus and system.
Background technique
With the rapid development of Internet technology and universal, the type of business website is also more and more, and business tine is also got over
Come abundanter.Currently, inquiry is the important means that both parties are linked up in business website, buyer may be implemented and the seller effectively pushes away
Wide product or the business demand for obtaining other side.
The inquiry typically refers to buyer by message mode in business website and inquires having inside the Pass for product to seller
Hold, such as price, specification etc..The total number of the word or phrase in inquiry is constituted generally within 200, belongs to short text content,
Such as common short text type has: comment, message, short message etc., can be sent inquiry by mail, instant messaging tools etc.
To other side.But in business website, mail, RFQ, (Request for Quotation's is write a Chinese character in simplified form, and is that a kind of buyer passes through handle at present
The detailed description of procurement demand is sent to open market, allows seller to look for buyer and provide the business model of quotation) etc. inquiry
Or in the service environment of similar inquiry, it is usually present a large amount of rubbish inquiry, causes the information interference to user, and bring money
The risks such as gold, account, information leakage.Rubbish inquiry typically refer to that buyer sends to seller for seller without practical business
The inquiry of meaning, the type for including is varied, mainly includes text garbage inquiry, fishing inquiry, advertisement inquiry etc..Especially
Fishing inquiry by camouflage the purpose is to cheat addressee for information-replies such as account, passwords to specified recipient, or is drawn
It leads addressee and is connected to special webpage, these webpages would generally disguise oneself as actual site, such as bank or the net of financing
Page, so that registrant takes it seriously, when lander logs on these webpages, account number cipher will be stolen.
The identification method of common rubbish inquiry is mainly based upon the identification of inquiry content of text in existing, such as simple pattra leaves
This scheme.This mode can identify pure text based rubbish inquiry to a certain extent, but for going fishing, cheating class
Other inquiry, since the information content of inquiry and normal inquiry similitude are very high, it is difficult to which text distinguishes.For going fishing, taking advantage of
Cheat classification inquiry, the mode that business is usually taken is to first pass through the strategies such as detection, judgement to identify rubbish account, then again by
Rubbish account relating goes out rubbish inquiry.This method needs to accumulate the behavioral data of certain time, therefore asking with hysteresis quality
Topic.
The identification method of rubbish inquiry is common in the prior art is individually modeled for different rubbish inquiry types, is examined
Survey mode is single, and recognition result has locality (such as above-mentioned can identify to inquiry content of text can not but identify fishing inquiry)
And hysteresis quality, so that the accuracy of entirety rubbish inquiry identification is lower at present, recognition effect is poor, reduces user experience and inquiry
The safety of disk information.
Summary of the invention
The application is designed to provide a kind of Method for text detection, apparatus and system, and rubbish text identification can be improved
Accuracy can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, and it is preferably accurate to have
Property, stability and timeliness, improve the safety of text information environment.
A kind of Method for text detection provided by the present application, apparatus and system be include that following mode is realized:
A kind of Method for text detection, which comprises
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified.
A kind of rubbish inquiry detection method, comprising:
The information content based on inquiry to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the inquiry to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the rubbish inquiry identification model constructed offline in advance, according to processing
As a result judge whether the inquiry to be identified is rubbish inquiry.
A kind of text detection device, described device include:
Text feature abstraction module generates text feature data for the information content based on text to be identified;
Behavioural characteristic abstraction module, for obtaining the behavior number of user's predefined type associated with the text to be identified
According to generation behavioural characteristic data;
Feature combination module, for that will include the data of the text feature data and behavioural characteristic data according to predetermined party
Formula is combined, and generates assemblage characteristic data;
Detection module, for using the assemblage characteristic identification model constructed in advance to the assemblage characteristic data at
Reason, the testing result of the text to be identified is determined according to processing result.
A kind of text detection device, comprising:
First abstraction module generates text feature data for the information content based on inquiry to be identified;
Second abstraction module, for obtaining the behavioral data of user's predefined type associated with the text to be identified,
Generate behavioural characteristic data;
Feature combination module, for that will include the data of the text feature data and behavioural characteristic data according to predetermined party
Formula is combined, and generates assemblage characteristic data;
Inquiry detection module, for utilizing the rubbish inquiry identification model constructed offline in advance to the assemblage characteristic data
It is handled, judges whether the inquiry to be identified is rubbish inquiry according to processing result.
A kind of text detection device, it is described including processor and for the memory of storage processor executable instruction
Processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified.
A kind of computer readable storage medium is stored thereon with computer instruction, and it is following that described instruction is performed realization
Step:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified
A kind of rubbish text detection system, including text detection device described in any one embodiment in this specification.
A kind of Method for text detection provided by the present application, apparatus and system, when detecting text, while having used in text
Appearance and behavioral data, the characteristic of the two type is combined, unified machine learning model can be used and be trained
And prediction.The application can carry out text detection in this dimension of content of text using the characteristic of content of text messages, together
When, the application (may include send before with a period of time later) can just obtain the text when text information is sent
Then behavioral data is combined using the characteristic of behavior data and the characteristic of content of text messages, formed new
Text detection data.Compared to the hysteresis quality for playing the modes such as existing rubbish account identification, the application embodiment is to rubbish
The timeliness of text identification detection is greatly enhanced.The application can be special by content of text and behavior using historical data offline
Sign data are combined, and are trained using the same machine learning model, then text can be identified on line,
Detection, can also avoid in existing certain embodiments the intervention of artificial experience in this way and thus bring recognition result is not
Controllably, de-stabilising effect.Therefore, using the application embodiment, can just be obtained when text is sent during sending with this
The behavioral data of textual association, and the combination with the data of content of text progress characteristic, can greatly improve rubbish text
The accuracy and timeliness of detection, improve the safety of text information.And by the characteristic after combination previously according to going through
It is detected in the identification model of the assemblage characteristic data building of history text, the result that text detection can be made to identify is more steady
It is fixed, reliable.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor property
Under, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of herein described Method for text detection embodiment;
Fig. 2 is the implementation method schematic diagram of a training building assemblage characteristic identification model in herein described method;
Fig. 3 is a kind of off-line training and to construct the treatment process signal of assemblage characteristic identification model in herein described method
Figure;
Fig. 4 is a kind of embodiment schematic diagram of text feature data and behavioural characteristic data combination provided by the present application;
Fig. 5 is the embodiment signal of another text feature data provided by the present application and the combination of behavioural characteristic data
Figure;
Fig. 6 is the treatment process schematic diagram that rubbish inquiry is identified in the application one embodiment;
Fig. 7 is a kind of method flow schematic diagram for identifying rubbish inquiry provided by the present application;
Fig. 8 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application.
Fig. 9 is a kind of modular structure schematic diagram of inquiry detection device embodiment provided by the present application;
Figure 10 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The application protection all should belong in technical staff's every other embodiment obtained without creative efforts
Range.
Fig. 1 is a kind of flow diagram of herein described Method for text detection embodiment.Although this application provides such as
Following embodiments or method operating procedure shown in the drawings or apparatus structure, but exist based on routine or without creative labor
Less operating procedure or modular unit after may include more in the method or device or part merging.In logicality
In the step of there is no necessary causalities or structure, the execution sequence of these steps or the modular structure of device are not limited to this Shen
It please embodiment or execution shown in the drawings sequence or modular structure.The device in practice of the method or modular structure,
Server or end product are in application, can be according to embodiment or method shown in the drawings or modular structure carry out sequence execution
It is either parallel execute (such as parallel processor or multiple threads environment, even include distributed treatment, server cluster
Implementation environment).
Existing rubbish inquiry is mainly to a large amount of sellers send rubbish inquiry text, fishing is ask in the form of Webpage
The rubbish inquiry of disk, advertisement inquiry etc., this " extensively casting net " mode can seriously reduce user experience and information security.The application
In the embodiment of offer, it is contemplated that, fraud, fishing class inquiry inquiry transmission process behavior expression and normal inquiry
Difference, therefore behavioral data can be just utilized when inquiry is sent, in combination with the characteristic of content of text, to detect text
This.The application is to identify that page rubbish inquiry is illustrated the application embodiment as application scenarios below.But this field
Technical staff is it can be understood that the connotation of this programme can be applied in the implement scene of other text detections, such as rubbish
Malice comment, malice message in rubbish mail recognition, the page or instant messaging tools, RFQ business scenario are medium.It will not do below
Replaceability description, is stated for applying the applicability in other implement scenes not tire out herein.
In a kind of a kind of implementation such as of Method for text detection provided by the present application, history inquiry can be utilized offline in advance
Data are input in the machine learning model of selection and are trained, and building, which generates, is based on content of text and behavioral data assemblage characteristic
Text identification model.Then it can use the model and recognition detection, output test result carried out to text to be identified.Specifically
A kind of Method for text detection implementation process as shown in Figure 1, may include:
S1: the information content based on text to be identified generates text feature data;
S2: the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic number are obtained
According to;
S3: the data including the text feature data and behavioural characteristic data are combined according to predetermined way, raw
At combination characteristic;
S4: the assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing
As a result the testing result of the text to be identified is determined
In general, the text usually may include the one piece of data information for recording and storing text information, it can
To there is individual text label, for distinguishing different text data set.Specifically in the present embodiment application scenarios, text
It may include the information contents such as comment, message, short message.The text can pass through mail, instant messaging tools, resource packet etc.
Form is transmitted, and realizes the information exchange of sender and recipient, as that can pass through the side of letter in standing in a kind of implement scene
Formula sends the inquiry of page format to recipient.It should be noted that text described herein is not intended to limit the interior of text
Appearance is the data format of text, character, in other application scenarios, can also include but is not limited in the text image,
The data information of the formats such as sound, video, the advertisement that may include graphic form such as the information content in inquiry, timing play 5
Flash, merchant page link of second etc..
In addition, the rubbish inquiry in the present embodiment application scenarios identifies, however it is not limited to which conventional buyer sends out to one side of seller
Send the scene of inquiry.Inquiry described in the present embodiment can refer to that transaction prepares purchase or sells a direction of certain commodity
The business conduct for a possibility that potential supplier or buyer seek bargain or the transaction of the commodity.Therefore, recognition detection
Text also may include inquiry that seller sends to seller.
In the present embodiment, can first with text historical data training machine learning model, can be with structure by training
Build the assemblage characteristic identification model that text to be identified is detected on line.The machine learning model, typically refers to computer
It is handled using existing data by some machine learning algorithms, model is trained, then using model to new data
Predict and export the processing logical collection of result.The present embodiment can use random forest, Logic Regression Models, convolutional Neural net
A variety of model realizations such as network.In general, in machine learning, can by these data with existing by machine learning algorithm at
The process of reason is called training, and the result of processing can be used to predict new data, this result is commonly referred to as mould
Type is called prediction to the prediction of new data in the process.
The historical data data that may include text specifying information content that model training uses and the text are to application
Then the behavioral data at family can extract the feature of both data, be spliced, generate the sample of machine learning model training
Data.In a kind of specific embodiment, the assemblage characteristic identification model may include being generated using following manner building:
S30: obtain identification text historical data, extract it is described identification text text feature data, and obtain with
The behavioural characteristic data of the identification associated user's predefined type of text;
S31: the text feature data and behavioural characteristic data are spliced according to the predetermined way, produce sample
Characteristic;
S32: the sample characteristics data are trained in the machine learning training pattern of selection, obtain the combination
Feature identification model.
Fig. 2 is the implementation method schematic diagram of a training building assemblage characteristic identification model in herein described method.This
Sample identifies mould using the assemblage characteristic that the historical data building of the characteristic and behavioral data that include content of text messages generates
Type can be used for detecting text to be identified, determine the classification of text to be identified.Certainly, the historical data can wrap
The historical data that a period of time acquisition obtains is included, also may include the data of real-time acquisition storage, such as obtain daily inquiry in real time
Disk is then stored in the update that historical data is completed in database, and so that assemblage characteristic identification model is further trained, optimization is adjusted
Mould preparation shape parameter improves the accuracy of text classification.The text feature data can be extracted from the historical data and be obtained
, in a kind of implement scene, the behavioural characteristic data, or other embodiments can also be obtained from the historical data
In, the behavioural characteristic data of user's subscription type can be obtained from another database, such as by burying a user for acquisition
The data information of operation behavior.
It, can first offline structure on line before detection identification rubbish inquiry in the specific implement scene of the application
Build the model of identification rubbish inquiry.The model can be machine learning model, be trained using history inquiry data, and can be with
The inquiry to be identified newly inputted is predicted on line, exports the recognition detection result of inquiry to be identified.The engineering
Model is practised, computer is typically referred to and is handled using existing data by some machine learning algorithms, train model, so
New data are predicted and export the processing logical collection of result using model afterwards.In general, in machine learning, it can be by this
A little data with existing are called training by the process that machine learning algorithm is handled, and the result of processing can be used to new number
According to being predicted, this result is commonly referred to as model, to being called prediction during the prediction of new data.It can be in the present embodiment
Training pattern of the random forest as history inquiry data.Certainly, in other implement scenes can also according to process demand or
Actual implementation link, which is chosen, for example chooses Logic Regression Models (LR, Logistic Regression), neural network, Huo Zhetong
Count the machine learning models such as model such as scorecard model.
In the application scenarios of the present embodiment page rubbish inquiry, the history of a period of time can be first extracted from database
Inquiry data.These history inquiry data may include the row before and after content of text messages and transmission inquiry in specific inquiry
For data.It (may include during inquiry is sent, for just that the behavioral data, which may include before and after the transmission of inquiry,
In description, herein by include inquiry send before, inquiry send after, inquiry send during in any one stage period
It is collectively referred to as before and after sending, it is equally applicable in other implement scenes, such as text sends front and back, text to be identified sends front and back, stays
Speech sends front and back etc..) capture the page operation behavioral data of acquisition or the data information of relevant specified type.In the present embodiment
The behavioral data of the predefined type may include identify text sender the identification text send before and after generate it is pre-
If the data information of behavior type.Specifically for example may include send the front and back page total residence time, mouse movement speed,
Route (track of mouse sliding), account login path, account continuously transmit the time interval of inquiry, browse path etc..
In the behavioral data of above-mentioned predefined type, wherein one or more types can only include text, such as inquiry, send
The data generated before, before only may include perhaps transmission including the behavioral data generated after sending or some types
The behavioral data for having altogether/generating jointly later.In such as scene of above-mentioned one inquiry of transmission, the behavior number of a predefined type
According to may include from enter inquiry edit page until send inquiry after leave the page residence time of the page in total.Or it is another
The behavioral data obtained in one example may include IP address modification information, if sender is at interval of fixed cycle or transmission
After the inquiry of fixed quantity, IP address can be actively changed.The behavioral data for needing to obtain which type can be specifically preset,
Then after obtaining history inquiry data, the behavioral data of these type inquiry can be extracted.
Certainly, it in some implement scenes, if there is no these behavioral datas in history inquiry data, can set pre-
The behavioral data of which type is first acquired, is then carrying out model training using these historical datas after acquiring a period of time.
After obtaining history inquiry data, the behavioral data and inquiry content of text of the transmission of inquiry process can be therefrom sorted out
Two class data, as shown in figure 3, Fig. 3 is a kind of off-line training and to construct assemblage characteristic identification model in herein described method
Treatment process schematic diagram.Further, characteristic can be extracted from inquiry content of text data and behavioral data respectively.
In inquiry content of text data, text feature data can be generated based on the information content that inquiry content of text specifically includes.
In a kind of embodiment provided by the present application, inquiry content map can be arrived for example, by word2vec, doc2vec mode
Then one higher dimensional space can be that the vector of n dimension is indicated with a length, such as [w1, w2, w3 ... wn].Specifically, described
In one embodiment of method, the text feature data include:
S101: the information content of the identification text is mapped to the vector that the length generated after higher dimensional space is n dimension, n
≥1。
Doc2vec is based on the method developed on the basis of Word2vec, is that one kind can be by a sentence or a piece
Chapter is mapped to the technology of a higher dimensional space, and one section of sentence can be characterized as real number value vector by it, each value of vector can
To represent the coordinate in the dimension.Which can use the mode of deep learning, can be by training, to content of text
Processing be reduced to the vector operation in n-dimensional vector space, and the similarity in vector space can be used to indicate text semantic
On similarity.Therefore, it can be extracted in inquiry content of text using word2vec or doc2vec mode in the embodiment of the present application
Text feature data, such text feature data can reflect the information content of inquiry to a certain extent, can be used for knowing
The not rubbish inquiry based on content of text.Each inquiry, which can correspond to, generates corresponding text feature data.The embodiment of the present application
Doc2vec has been selected text is mapped to hyperspace, other mapping methods or text can also be chosen in other embodiments
The method that eigen extracts, such as the feature extracting method of vector space model, statistics.
For behavioral data, the behavioural characteristic data of certain data format for setting can be arranged.Specific row
The format for being characterized data can be configured according to data processing needs and/or real time environment.If such as in the present embodiment
It needs the behavioral data of m kind predefined type to participate in calculating, to identify rubbish inquiry, then can set behavioural characteristic data to
Length is the feature vector of m dimension, such as [a1, a2, a3 ..., am].Wherein each dimension can represent the row of a predefined type
For data, specific value can be obtained according to conversion, the calculating etc. of behavioral data.Therefore, one embodiment of the application
In, the behavioral data may include:
S201: being corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type
On coordinate value generate length be m dimension vector, m >=1.
In a specific example, if including in the behavioral data obtained, an inquiry sends preceding in the stop of the page
Between be 5 seconds, be 3 in website browsing page number, mouse movement speed is 1 meter per second, then behavioural characteristic data can be generated can be with
The feature vector [5,3,1] for being 3 for length.Each vector dimension can be a predetermined class in the behavioural characteristic data
Type, as in [a1, a2, a3 ..., am], it is preceding in the residence time of the page, a2 that a1 vector dimension can indicate that the inquiry obtained is sent
Indicate the sum etc. for the browsing pages that current browser is opened.In a kind of embodiment, the behavioral data that is extracted in historical data
There can be mapping relations with behavioural characteristic data, i.e., the value of each vector dimension can be obtained directly in behavioural characteristic data
Correspondence behavioral data value, can also by behavioral data by deformation, transformation, calculate etc. generate text behavioural characteristic number
According to.Such as sender's account login mode that certain behavioral data is record, if it is " account number cipher login ", record is characterized
Vector a4, value 0;If login mode is " two-dimensional code scanning login ", the value of a4 is 1.And so on, it can be at certain
On the behavioral data of a type, it converts behavioural characteristic data accordingly by it.
In addition, the value of each predefined type behavioral data can preset sequence side in the behavioural characteristic data
Formula, as shown in figure 4, can be [page residence time, website traffic page sum, mouse movement speed, account login mode],
[account login mode, website traffic page sum, mouse movement speed, page stop can also be designed as according to design requirement
Time].In the present embodiment, the different sortord of vector dimension can produce different behavioural characteristic numbers in feature vector
According to.Correspondingly, can also generate different assemblage characteristic vectors in subsequent characteristics combination.It in this way can also be from the same text
Text detection in multiple assemblage characteristic data implementation model training or line.
History inquiry data pass through the extraction of above-mentioned text feature and behavioural characteristic, and each inquiry can correspond to a n dimension
The text feature data of feature vector and the behavioural characteristic data of m dimensional feature vector.It then can be by text feature data and row
It is characterized data to be spliced and combined, generates the sample characteristics data of corresponding inquiry.
Text feature data and the combination of behavioural characteristic data can be used in the embodiment of the present application, certainly, the application does not arrange
Except the data information of other users recognition detection text can also be added in other embodiments, such as the account of inquiry sender
Number, the credit data of inquiry sender log in IP or the entry address of districts and cities etc..These data can be with one or more groups of
New feature vector is combined with text feature data, behavioural characteristic data, if credit score is 69, then credit feature data
T can be [69].Alternatively, entry address is stepped on text sender registered address or often in the characteristic A of entry address
It is consistent to record address, then A value is 0, if differing bigger in the two physical address or logical address, the value of A is bigger.Then will
The value of A and/T participate in assemblage characteristic data, carry out mould together with institute's text feature data and behavioural characteristic data
Type training/text identification.
The combination of the text feature data and behavioural characteristic data, specific combination can be according to data processing
It needs, the parameter setting of machine learning model etc. uses different embodiments.Dimension is such as corresponded to be added, after partial data weighting
It combines, be converted to and combine after another data renormalization etc..In a kind of embodiment provided by the present application, it is described will be described
The text feature data and behavioural characteristic data for identifying text are combined according to predetermined way and may include:
S301: the text feature data are mutually spliced with behavioural characteristic data, generate the group that length is (n+m) dimension
Close characteristic.
Such as in the implement scene of this implementation identification rubbish inquiry, text feature data and behavioural characteristic data are spliced
Afterwards, the new assemblage characteristic data of formation be (n+m) dimension feature vector can for [w1, w2, w3 ..., wn, a1, a2,
A3 ..., am].Certainly, other data, such as credit feature can also be added in other embodiments in assemblage characteristic data
Data, then assemblage characteristic data can be [w1, w2, w3 ... wn, a1, a2, a3 ..., am, t].In assemblage characteristic data not
Syntagmatic customized can be arranged before and after the data of same type, credit data can such as be placed above the other things, behind let pass again
It is characterized data, text feature data.Fig. 4 is a kind of text feature data and behavioural characteristic data combination provided by the present application
Embodiment schematic diagram.
The combination can also include other embodiments, identify the text feature data of text as will be described
Operation is carried out with value of the behavioural characteristic data in corresponding dimension, is such as added, is multiplied, or after being calculated according to other preset algorithms
A numerical value in the correspondence dimension is synthesized, the assemblage characteristic data in the corresponding dimension are obtained.Such as using corresponding dimension
On value be added then available [w1+a1, w2+a2, w3+a3 ... wn+am], wherein vacancy position can if n is unequal with m
With 0 or the replacement of other preset values.Fig. 5 is another text feature data provided by the present application and the combination of behavioural characteristic data
Embodiment schematic diagram can generate n data if n ratio m is big by 2 in assemblage characteristic data, wherein last two are w (n-
1)、n。
After forming sample characteristics data, it can be input in the Random Forest model of the present embodiment use and be trained.Make
The sample characteristics data training generated with history inquiry data, after reaching certain prediction index, can be supplied on line makes
With carrying out detection classification to inquiry to be identified, judge whether it is rubbish inquiry.In machine learning, random forest is a packet
Classifier containing multiple decision trees.Such as main treatment process may include: in a kind of implementation
First, take the sampling put back to from sample data concentration, construct Sub Data Set, the data volume of Sub Data Set be with
Raw data set is identical.The element of different Sub Data Sets can repeat, and the element in the same Sub Data Set can also repeat.
Second, sub-tree is constructed using Sub Data Set, this data is put into each sub-tree, every height is determined
Plan tree exports a result.Finally, if there is new data need to obtain classification results by random forest, so that it may pass through
Ballot to the judging result of sub-tree obtains the output result of random forest.
For example, it is assumed that have 3 stalk decision trees in random forest, and after text-processing to be identified, the classification results of 2 stalk trees
It is A class rubbish inquiry, the classification results of 1 stalk tree are the normal inquiry of B class, then the classification results of random forest are exactly A class rubbish
Rubbish inquiry.
It is constructed using aforesaid way after generating assemblage characteristic identification model, the identification that text can be carried out for inline system is examined
It surveys.When needing to detect some text, text feature data can be generated based on the information content of text to be identified, while can be with
The behavioral data of the corresponding text to be identified is obtained, and generates behavioural characteristic data.Then both classes will can be included at least
After another characteristic data are combined, are come out, obtained to be identified using the assemblage characteristic identification model that above-mentioned training generates
The recognition detection result of text.A specific implement scene is as shown in fig. 6, Fig. 6 is to identify rubbish in the application one embodiment
The treatment process schematic diagram of rubbish inquiry may include data Kuku and the specific text text for storing user's inquiry process state data
Then the database of content-data extracts behavioural characteristic data respectively and text feature data is trained.Obtain it is described to
The behavioural characteristic data of identification text may include the default behavior for obtaining user and generating before and after the text to be identified is sent
The data information of type can greatly improve the timeliness of text detection in this way.And due to automatically extracting content of text messages
And behavioral data, it is trained and identifies using the characteristic that uniform rules is formed after combination, it is possible to reduce artificial subjectivity
Intervene, the accuracy and reliability of text identification is also significantly increased.
It should be noted that including the treatment process and real-time detection text to be identified of the building of said combination feature identification model
In this treatment process, the application does not limit the sequencing for obtaining text feature data and acquisition behavioural characteristic data
It is fixed, it can be handled simultaneously in some embodiments and obtain two kinds of data.In addition, the historical data can be and contain text
The data of the information content and with text to associated attribute information, operation behavior, interlock account login behavior etc. behavior number
According to.In some implement scenes, if only obtaining the historical data of text, and behavioral data using other modes or from other
Database (data service that such as third party provides) acquires, and present techniques personnel similarly can be understood as belonging to this Shen
The scope of historical data that please be described.
Above-mentioned described embodiment using identify rubbish inquiry as implement scene carry out scheme description, but be based on the application
Substantive innovative idea can be also used for the text identification of other business scenarios, such as spam, maliciously comment on, message of pouring water, i.e.
When log, Email attachment, identification, detection, the classification of texts such as attachment of instant messaging transmission etc..Therefore, the method
Another embodiment in, the text to be identified may include at least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
A kind of Method for text detection provided by the present application, while content of text and behavioral data have been used, by the two type
Characteristic be combined, unified machine learning model can be used and be trained and predict.The application uses text envelope
The characteristic for ceasing content can carry out text detection in this dimension of content of text, meanwhile, the application can be in text information
Just the behavioral data of the text is obtained when transmission, then utilizes the feature of the characteristic of behavior data and content of text messages
Data are spliced, and the data of new text detection are formed.Compared to the hysteresis quality of the modes such as existing rubbish account identification, this Shen
Please embodiment be greatly enhanced to the timeliness of rubbish text recognition detection.The application can utilize historical data will offline
Content of text is combined with behavioural characteristic data, is trained using unified machine learning model, then can be on line
Text is identified, is detected, the intervention of artificial experience and thus can also be avoided in existing certain embodiments in this way
Bring is uncontrollable, de-stabilising effect.Therefore, using the application embodiment, the transmission phase can just be obtained when text is sent
Between with the associated behavioral data of the text, and with the data of content of text carry out characteristic combination, can greatly improve
The accuracy and timeliness of rubbish text detection, improve the safety of text information.And by the characteristic after combination pre-
It is first detected according in the identification model of the assemblage characteristic data of history text building, the knot that text detection can be made to identify
Fruit is more stable, reliable.
The description of scene based on the above embodiment, the application also provide the another embodiment of the method.In this reality
It applies in example, it, can be respectively by the information content of text and associated with the text in specific characteristic combined treatment
The behavioral data of user is converted to corresponding feature vector, is combined using the data mode of feature vector, and it is special to generate combination
Levy data.Specifically, in a kind of another embodiment of Method for text detection provided by the present application, the method may include:
The information content based on text to be identified generates Text eigenvector;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic vector;
Data including the Text eigenvector and behavioural characteristic vector are combined according to predetermined way, generation group
Close feature vector;
The assemblage characteristic vector is handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified.
In this way, the data format using vector carries out text feature data and the extraction of behavioural characteristic data, conversion, group
The processing such as conjunction, training, can further simplify data processing, improve computer digital animation speed, and then it is literary to improve rubbish
This detection, recognition efficiency.
Method described above has good recognition effect to identification rubbish inquiry, can effectively improve the identification of rubbish inquiry
Accuracy rate and recall rate, reduce user receive rubbish inquiry frequency and enter wherein, fishing the page risk.Therefore, originally
Application, which also provides one kind, can be used for rubbish inquiry knowledge method for distinguishing, as shown in fig. 7, Fig. 7 is a kind of identification provided by the present application
The method flow schematic diagram of rubbish inquiry specifically may include:
S40: the information content based on inquiry to be identified generates text feature data;
S41: the behavioral data of user associated with the inquiry to be identified, generation behavioural characteristic data are obtained;
S42: the data including the text feature data and behavioural characteristic data are combined according to predetermined way, raw
At combination characteristic;
S43: being handled the assemblage characteristic data using the rubbish inquiry identification model constructed offline in advance, according to
Processing result judges whether the inquiry to be identified is rubbish inquiry.
Certainly, embodiment is also applied in RFQ scene.In this way, using the application embodiment, in some business
On website, the rubbish inquiry and rubbish RFQ recognized can be intercepted, be tracked, raising customer experience, reduction fraud,
The risks such as fishing.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.
Based on Method for text detection described above, the application also provides a kind of text detection device.The device can
With include the use of the system (including distributed system) of herein described method, software (application), module, component, server,
Client etc. simultaneously combines the necessary device for implementing hardware.Based on same innovation thinking, in a kind of embodiment provided by the present application
Device as described in the following examples.Since the implementation that device solves the problems, such as is similar to method, the application is specific
The implementation of device may refer to the implementation of preceding method, overlaps will not be repeated.It is used below, term " unit " or
The combination of the software and/or hardware of predetermined function may be implemented in person's " module ".Although device described in following embodiment is preferable
Ground is realized with software, but the realization of the combination of hardware or software and hardware is also that may and be contemplated.Specifically,
Fig. 8 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application, as shown in figure 8, described device can
To include:
Text feature abstraction module 101 can be used for the information content based on text to be identified and generate text feature data;
Behavioural characteristic abstraction module 102 can be used for obtaining user's predefined type associated with the text to be identified
Behavioral data, generate behavioural characteristic data;
Feature combination module 103, can be used for will include the text feature data and behavioural characteristic data data by
It is combined according to predetermined way, generates assemblage characteristic data;
Detection module 104 can be used for using the assemblage characteristic identification model constructed in advance to the assemblage characteristic data
It is handled, the testing result of the text to be identified is determined according to processing result.
The assemblage characteristic identification model can be trained generation previously according to the historical data of text.It can choose
Machine learning model is trained, and the historical data that training uses may include the characteristic and behavior of the information content of text
Data.In specific one embodiment, may include: in the detection module 104
Model training module 1041 can be used for obtaining the historical data of identification text, extract the text of the identification text
Eigen data and the behavioural characteristic data for obtaining user's predefined type associated with the identification text;It can be also used for
The text feature data and behavioural characteristic data are combined according to the predetermined way, produce sample characteristics data;With
And be trained the sample characteristics data in the machine learning training pattern of selection, obtain the assemblage characteristic identification mould
Type.
During the training of assemblage characteristic identification model, the behavioral data of the predefined type may include identification text
The data information for the default behavior type that this sender generates before and after the identification text is sent.Correspondingly, right on line
During text carries out recognition detection, the behavioral data for obtaining the text predefined type to be identified includes described in acquisition
The data information for the default behavior type that the sender of text to be identified generates before and after the text to be identified is sent.
Text Feature Extraction is converted to the mode such as preceding method embodiment of corresponding text feature data and behavioural characteristic data
It is described can be there are many embodiment.In a kind of embodiment of herein described device, the text feature abstraction module 101 is obtained
The text feature data are taken specifically to may include:
The information content of the identification text is mapped to after higher dimensional space and generates the vector that length is n dimension, n >=1;
And the acquisition of the behavioural characteristic abstraction module 102 behavioural characteristic data specifically may include:
It is seat on corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type
Scale value generates the vector that length is m dimension, m >=1.
Text feature data and the mode of behavioural characteristic data combination can be set according to different data processing needs
It sets, using different combinations.In a kind of embodiment of described device provided by the present application, the feature combination module 103
May include at least one of following:
Merging features module 1031 can be used for the behavioural characteristic data of the n text feature data tieed up and m dimension carrying out phase
Splicing generates the assemblage characteristic data that length is (n+m) dimension;
Dimensional characteristics computing module 1032 can be used for the text feature data and behavioural characteristic number of the identification text
Operation is carried out according to the value in corresponding dimension, obtains the assemblage characteristic data in the corresponding dimension.
The feature combination module 103 text feature data can be carried out using any one above-mentioned module and behavior is special
Levy the combination of data.
Described device can be used in rubbish inquiry identifying in the application scenarios of inquiry, and the text to be identified described at this time can be with
Inquiry information including rubbish inquiry, can also apply in other implement scenes, such as spam detection, malice comment, RFQ
In business scenario.Each implement scene can it is corresponding using machine learning model to the text feature data of historical data and
Assemblage characteristic data after the combination of behavioural characteristic data carry out model training, then the corresponding text of recognition detection.Therefore, described
Text to be identified may include at least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
It is, of course, also possible to include other embodiments, such as the assemblage characteristic data can be with account information data, letter
With data, entry address etc., other detect the information data of text type for identification.
The application also provides a kind of device suitable for identifying rubbish inquiry, as shown in figure 9, Fig. 9 is provided by the present application
A kind of modular structure schematic diagram of inquiry detection device embodiment may include:
First abstraction module 201 can be used for the information content based on inquiry to be identified and generate text feature data;
Second abstraction module 202 can be used for obtaining the behavioral data of user associated with the inquiry to be identified, give birth to
At behavioural characteristic data;
Feature combination module 203, can be used for will include the text feature data and behavioural characteristic data data by
It is combined according to predetermined way, generates assemblage characteristic data;
Inquiry detection module 204 can be used for using the rubbish inquiry identification model constructed offline in advance to the combination
Characteristic is handled, and judges whether the inquiry to be identified is rubbish inquiry according to processing result.
Certainly, the rubbish inquiry identification model can according to the history inquiry data of a period of time of extraction, including
Inquiry text and send behavioral data before and after inquiry, carry out respectively text feature extract and behavioural characteristic extract after for example random
It constructs and generates after being trained in the machine learning module of forest.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for module class
For the embodiment of device, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to side
The part of method embodiment illustrates.
The application method or apparatus described above can be implemented by computer program in conjunction with necessary hardware, Ke Yishe
It sets in the terminal devices such as mobile terminal, server, distributed system, it is more quasi- in combination with content of text and behavioral data
Really, reliably, high-timeliness to such as rubbish inquiry, spam, message of pouring water, malice comment etc. texts carry out identification inspection
It surveys.Therefore, the application also provides a kind of text detection device, may include processor and finger can be performed for storage processor
The memory of order, the processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified.
Method or apparatus described in the above embodiments of the present application can realize service logic and be recorded by computer program
On a storage medium, the storage medium can be read and be executed with computer, realize scheme described by the embodiment of the present application
Effect.Therefore, the application also provides a kind of computer readable storage medium, is stored thereon with computer instruction, described instruction quilt
Following steps may be implemented when execution:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group
Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result
Determine the testing result of the text to be identified.
The computer readable storage medium may include the physical unit for storing information, usually by message digit
It is stored again by the media in the way of electricity, magnetic or optics etc. after change.Computer-readable storage medium described in the present embodiment
It may include: that the device of information is stored in the way of electric energy such as that matter, which has, various memory, such as RAM, ROM;In the way of magnetic energy
Store information device such as, hard disk, floppy disk, tape, core memory, magnetic bubble memory, USB flash disk;It is stored and is believed using optical mode
The device of breath such as, CD or DVD.Certainly, there are also the readable storage medium storing program for executing of other modes, such as quantum memory, graphene to store
Device etc..
It should be noted that description of this specification device or electronic equipment described above according to related method embodiment
It can also include other embodiments, concrete implementation mode is referred to the description of embodiment of the method, does not make herein one by one
It repeats.All the embodiments in this specification are described in a progressive manner, same and similar part between each embodiment
It may refer to each other, each embodiment focuses on the differences from other embodiments.Especially for hardware+journey
For sequence class, storage medium+program embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, phase
Place is closed to illustrate referring to the part of embodiment of the method.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.
Device or method described above can be used in multiple business system, can effectively improve rubbish text identification
Accuracy can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, and it is preferably accurate to have
Property, stability and timeliness, improve the safety of text information environment in operation system, improve user experience.Therefore, the application
A kind of rubbish text detection system is also provided, can be individual rubbish text detection system, can also apply in multiple types
Text services system in, such as inquiry operation system or mailing system.The TEXT system may include in above-described embodiment
Text detection device described in any one.
The system can be individual server, be also possible to the system cluster of more application servers composition,
It can be the service in distributed system.Figure 10 is a kind of knot of embodiment of a kind of rubbish text detection system provided by the present application
Structure schematic diagram.
It should be noted that system described above can also include other embodiment party according to the description of embodiment of the method
Formula, concrete implementation mode are referred to the description of related method embodiment, do not repeat one by one herein.
A kind of Method for text detection provided by the present application, apparatus and system, when detecting text, while having used in text
Appearance and behavioral data, the characteristic of the two type is combined, unified machine learning model can be used and be trained
And prediction.The application can carry out text detection in this dimension of content of text using the characteristic of content of text messages, together
When, the application can just obtain the behavioral data of the text when text information is sent, and then utilize the feature of behavior data
The characteristic of data and content of text messages is spliced, and the data of new text detection are formed.Compared to existing rubbish
The hysteresis quality of the modes such as account identification, the application embodiment are greatly enhanced to the timeliness of rubbish text recognition detection.
The application can offline be combined content of text and behavioural characteristic data using historical data, and the same engineering is used
It practises model to be trained, then text can be identified on line, detected, existing certain implementations can also be avoided in this way
The intervention of artificial experience and thus bring is uncontrollable, de-stabilising effect in scheme.Therefore, using the application embodiment,
Can text send when just obtain send during with the associated behavioral data of the text, and with the data of content of text carry out
The combination of characteristic can greatly improve the accuracy and timeliness of rubbish text detection, improve the safety of text information.
And the characteristic after combination is examined in the identification model of the assemblage characteristic data building previously according to history text
It surveys, the result that text detection can be made to identify is more stable, reliable.
Although mentioning doc2vec higher dimensional space mapping text, random forest building training pattern, the page in teachings herein
The number of behavioral data, text feature and the behaviors of types such as residence time/mouse movement speed/account login times or the like
According to the description that the definition of the data such as/mode, acquisition, interaction, calculating, judgement and embodiment are realized, still, the application not office
It is limited to meet industry programming language standard, normal data model/algorithm, general-purpose computer processes and storage rule
Or situation described in the embodiment of the present application.Certain professional standards or the implementation base described using customized mode or embodiment
On plinth embodiment modified slightly also may be implemented above-described embodiment it is identical, it is equivalent or it is close or deformation after it is anticipated that
Implementation result.Using the embodiment of the acquisitions such as these modifications or deformed data acquisition, storage, judgement, processing mode, still
It may belong within the scope of the optional embodiment of the application.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example,
Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So
And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit.
Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause
This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device
(Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate
Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer
Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker
Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled
Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development,
And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language
(Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL
(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description
Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL
(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby
Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present
Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer
This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages,
The hardware circuit for realizing the logical method process can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing
The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can
Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit,
ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller
Device: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320 are deposited
Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to
Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic
Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc.
Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it
The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions
For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used
Think personal computer, laptop computer, vehicle-mounted human-computer interaction device, cellular phone, camera phone, smart phone, individual
Digital assistants, media player, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or
The combination of any equipment in these equipment of person.
Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive
The means for the property made may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps
One of execution sequence mode, does not represent and unique executes sequence.It, can be with when device in practice or end product execute
It is executed according to embodiment or method shown in the drawings sequence or parallel executes (such as parallel processor or multiple threads
Environment, even distributed data processing environment).The terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that process, method, product or equipment including a series of elements are not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, product or equipment
Intrinsic element.In the absence of more restrictions, be not precluded include the process, method of the element, product or
There is also other identical or equivalent elements in person's equipment.
For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each module can be realized when application in the same or multiple software and or hardware, can also will realize same function
Can module by multiple submodule or subelement combination realize etc..Installation practice described above is only schematic
, for example, the division of the unit, only a kind of logical function partition, there may be another division manner in actual implementation,
Such as multiple units or components can be combined or can be integrated into another system, or some features can be ignored, or not hold
Row.Another point, shown or discussed mutual coupling, direct-coupling or communication connection can be through some interfaces,
The indirect coupling or communication connection of device or unit can be electrical property, mechanical or other forms.
It is also known in the art that other than realizing controller in a manner of pure computer readable program code, it is complete
Entirely can by by method and step carry out programming in logic come so that controller with logic gate, switch, specific integrated circuit, programmable
Logic controller realizes identical function with the form for being embedded in microcontroller etc..Therefore this controller is considered one kind
Hardware component, and the structure that the device for realizing various functions that its inside includes can also be considered as in hardware component.Or
Person even, can will be considered as realizing the device of various functions either the software module of implementation method can be hardware again
Structure in component.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product.
Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application
Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code
The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
The application can describe in the general context of computer-executable instructions executed by a computer, such as program
Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, group
Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by
Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with
In the local and remote computer storage media including storage equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ",
The description of " specific example " or " some examples " etc. means specific features described in conjunction with this embodiment or example, structure, material
Or feature is contained at least one embodiment or example of the application.In the present specification, to the schematic of above-mentioned term
Statement is necessarily directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can
Can be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, ability
The technical staff in domain can be by different embodiments or examples described in this specification and the feature of different embodiments or examples
It is combined.
The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art
For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal
Replacement, improvement etc., should be included within the scope of the claims of this application.
Claims (20)
1. a kind of Method for text detection, which is characterized in that the described method includes:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination
Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result
The testing result of the text to be identified.
2. a kind of Method for text detection as described in claim 1, which is characterized in that the assemblage characteristic identification model includes adopting
It is constructed and is generated with following manner:
The historical data of identification text is obtained, the text feature data of the identification text are extracted, and is obtained and the identification
The behavioural characteristic data of the associated user's predefined type of text;
The text feature data and behavioural characteristic data are combined according to the predetermined way, produce sample characteristics number
According to;
The sample characteristics data are trained in the machine learning training pattern of selection, obtain the assemblage characteristic identification
Model.
3. a kind of Method for text detection as claimed in claim 2, which is characterized in that the behavioral data of the predefined type includes
Identify the data information for the default behavior type that the sender of text generates before and after the identification text is sent.
4. a kind of Method for text detection as claimed in claim 2, which is characterized in that the assemblage characteristic data also include at least
It is one of following:
Account information data, credit data, entry address.
5. a kind of Method for text detection as claimed in claim 2, which is characterized in that the text feature data include will be described
The information content of identification text is mapped to the vector that the length generated after higher dimensional space is n dimension, n >=1;
It is pair that the behavioural characteristic data, which include by the value of vector dimension, the behavioral data of predefined type of the predefined type,
The length for answering the coordinate value on vector dimension to generate is the vector of m dimension, m >=1.
6. a kind of Method for text detection as claimed in claim 5, which is characterized in that the text by the identification text is special
Sign data and behavioural characteristic data are combined according to predetermined way and include:
The text feature data are mutually spliced with behavioural characteristic data, generate the assemblage characteristic number that length is (n+m) dimension
According to.
7. a kind of Method for text detection as claimed in claim 5, which is characterized in that the text by the identification text is special
Sign data and behavioural characteristic data are combined according to predetermined way and include:
The value of the text feature data and behavioural characteristic data of the identification text in corresponding dimension is subjected to operation, is obtained
Assemblage characteristic data in the corresponding dimension.
8. the Method for text detection as described in any one of claim 1-7, which is characterized in that the text to be identified includes
At least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
9. a kind of rubbish inquiry detection method characterized by comprising
The information content based on inquiry to be identified generates text feature data;
Obtain the behavioral data of user associated with the inquiry to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination
Levy data;
The assemblage characteristic data are handled using the rubbish inquiry identification model constructed offline in advance, according to processing result
Judge whether the inquiry to be identified is rubbish inquiry.
10. a kind of text detection device, which is characterized in that described device includes:
Text feature abstraction module generates text feature data for the information content based on text to be identified;
Behavioural characteristic abstraction module, for obtaining the behavioral data of user's predefined type associated with the text to be identified,
Generate behavioural characteristic data;
Feature combination module, for will include the text feature data and behavioural characteristic data data according to predetermined way into
Row combination, generates assemblage characteristic data;
Detection module, for being handled using the assemblage characteristic identification model constructed in advance the assemblage characteristic data, root
The testing result of the text to be identified is determined according to processing result.
11. a kind of text detection device as claimed in claim 10, which is characterized in that the detection module includes:
Model training module, for obtaining the historical data of identification text, and acquisition use associated with the identification text
The behavioural characteristic data of family predefined type;It is also used to the text feature data and behavioural characteristic data according to the predetermined party
Formula is combined, and produces sample characteristics data;And by the sample characteristics data in the machine learning training pattern of selection
It is trained, obtains the assemblage characteristic identification model.
12. a kind of text detection device as claimed in claim 11, which is characterized in that the behavioral data of the predefined type
Including identifying the sender of text in the data information for identifying the default behavior type generated before and after text transmission.
13. a kind of text detection device as claimed in claim 12, which is characterized in that the text feature abstraction module obtains
Text feature data include:
The information content of the identification text is mapped to after higher dimensional space and generates the vector that length is n dimension, n >=1;
And the behavioural characteristic abstraction module acquisition behavioural characteristic data include:
It is coordinate value on corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type
Generate the vector that length is m dimension, m >=1.
14. a kind of text detection device as claimed in claim 13, which is characterized in that the feature combination module includes following
At least one of:
Merging features module, the text feature data for tieing up n are mutually spliced with the behavioural characteristic data that m is tieed up, and generate length
Degree is the assemblage characteristic data of (n+m) dimension;
Dimensional characteristics computing module, for the text feature data of the identification text and behavioural characteristic data to be corresponded to dimension
On value carry out operation, obtain the assemblage characteristic data in the corresponding dimension.
15. a kind of text detection device as claimed in claim 11, which is characterized in that the text to be identified includes in following
At least one text type:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
16. a kind of text detection device characterized by comprising
First abstraction module generates text feature data for the information content based on inquiry to be identified;
Second abstraction module generates behavioural characteristic for obtaining the behavioral data of user associated with the inquiry to be identified
Data;
Feature combination module, for will include the text feature data and behavioural characteristic data data according to predetermined way into
Row combination, generates assemblage characteristic data;
Inquiry detection module, for being carried out using the rubbish inquiry identification model constructed offline in advance to the assemblage characteristic data
Processing, judges whether the inquiry to be identified is rubbish inquiry according to processing result.
17. a kind of text detection device, which is characterized in that including processor and depositing for storage processor executable instruction
Reservoir, the processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination
Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result
The testing result of the text to be identified.
18. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that described instruction is performed
When perform the steps of
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination
Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result
The testing result of the text to be identified.
19. a kind of rubbish text detection system, which is characterized in that including text described in any one of claim 10 to 17
Detection device.
20. a kind of Method for text detection, which is characterized in that the described method includes:
The information content based on text to be identified generates Text eigenvector;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic vector;
Data including the Text eigenvector and behavioural characteristic vector are combined according to predetermined way, it is special to generate combination
Levy vector;
The assemblage characteristic vector is handled using the assemblage characteristic identification model constructed in advance, is determined according to processing result
The testing result of the text to be identified.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710549655.8A CN109213859A (en) | 2017-07-07 | 2017-07-07 | A kind of Method for text detection, apparatus and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710549655.8A CN109213859A (en) | 2017-07-07 | 2017-07-07 | A kind of Method for text detection, apparatus and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109213859A true CN109213859A (en) | 2019-01-15 |
Family
ID=64991074
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710549655.8A Pending CN109213859A (en) | 2017-07-07 | 2017-07-07 | A kind of Method for text detection, apparatus and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109213859A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347797A (en) * | 2019-07-10 | 2019-10-18 | 广州市百果园信息技术有限公司 | Method for detecting, system, equipment and the storage medium of text information |
CN110502614A (en) * | 2019-08-16 | 2019-11-26 | 阿里巴巴集团控股有限公司 | Text hold-up interception method, device, system and equipment |
CN110705250A (en) * | 2019-09-23 | 2020-01-17 | 义语智能科技(广州)有限公司 | Method and system for identifying target content in chat records |
CN111259140A (en) * | 2020-01-13 | 2020-06-09 | 长沙理工大学 | False comment detection method based on LSTM multi-entity feature fusion |
CN111400714A (en) * | 2020-04-16 | 2020-07-10 | Oppo广东移动通信有限公司 | Virus detection method, device, equipment and storage medium |
CN111416812A (en) * | 2020-03-16 | 2020-07-14 | 深信服科技股份有限公司 | Malicious script detection method, equipment and storage medium |
CN112784007A (en) * | 2020-07-16 | 2021-05-11 | 上海芯翌智能科技有限公司 | Text matching method and device, storage medium and computer equipment |
CN112990172A (en) * | 2019-12-02 | 2021-06-18 | 阿里巴巴集团控股有限公司 | Text recognition method, character recognition method and device |
CN113971400A (en) * | 2020-07-24 | 2022-01-25 | 北京字节跳动网络技术有限公司 | Text detection method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246655A (en) * | 2012-02-03 | 2013-08-14 | 腾讯科技(深圳)有限公司 | Text categorizing method, device and system |
CN103853841A (en) * | 2014-03-19 | 2014-06-11 | 北京邮电大学 | Method for analyzing abnormal behavior of user in social networking site |
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
US20150317078A1 (en) * | 2008-01-09 | 2015-11-05 | Apple Inc. | Method, device, and graphical user interface providing word recommendations for text input |
CN105389379A (en) * | 2015-11-20 | 2016-03-09 | 重庆邮电大学 | Rubbish article classification method based on distributed feature representation of text |
US20160124941A1 (en) * | 2014-11-04 | 2016-05-05 | Fujitsu Limited | Translation device, translation method, and non-transitory computer readable recording medium having therein translation program |
-
2017
- 2017-07-07 CN CN201710549655.8A patent/CN109213859A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150317078A1 (en) * | 2008-01-09 | 2015-11-05 | Apple Inc. | Method, device, and graphical user interface providing word recommendations for text input |
CN103246655A (en) * | 2012-02-03 | 2013-08-14 | 腾讯科技(深圳)有限公司 | Text categorizing method, device and system |
CN103853841A (en) * | 2014-03-19 | 2014-06-11 | 北京邮电大学 | Method for analyzing abnormal behavior of user in social networking site |
US20160124941A1 (en) * | 2014-11-04 | 2016-05-05 | Fujitsu Limited | Translation device, translation method, and non-transitory computer readable recording medium having therein translation program |
CN104951542A (en) * | 2015-06-19 | 2015-09-30 | 百度在线网络技术(北京)有限公司 | Method and device for recognizing class of social contact short texts and method and device for training classification models |
CN105389379A (en) * | 2015-11-20 | 2016-03-09 | 重庆邮电大学 | Rubbish article classification method based on distributed feature representation of text |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347797A (en) * | 2019-07-10 | 2019-10-18 | 广州市百果园信息技术有限公司 | Method for detecting, system, equipment and the storage medium of text information |
CN110502614A (en) * | 2019-08-16 | 2019-11-26 | 阿里巴巴集团控股有限公司 | Text hold-up interception method, device, system and equipment |
CN110502614B (en) * | 2019-08-16 | 2023-05-09 | 创新先进技术有限公司 | Text interception method, device, system and equipment |
CN110705250A (en) * | 2019-09-23 | 2020-01-17 | 义语智能科技(广州)有限公司 | Method and system for identifying target content in chat records |
CN112990172B (en) * | 2019-12-02 | 2023-12-22 | 阿里巴巴集团控股有限公司 | Text recognition method, character recognition method and device |
CN112990172A (en) * | 2019-12-02 | 2021-06-18 | 阿里巴巴集团控股有限公司 | Text recognition method, character recognition method and device |
CN111259140A (en) * | 2020-01-13 | 2020-06-09 | 长沙理工大学 | False comment detection method based on LSTM multi-entity feature fusion |
CN111259140B (en) * | 2020-01-13 | 2023-07-28 | 长沙理工大学 | False comment detection method based on LSTM multi-entity feature fusion |
CN111416812B (en) * | 2020-03-16 | 2022-06-21 | 深信服科技股份有限公司 | Malicious script detection method, equipment and storage medium |
CN111416812A (en) * | 2020-03-16 | 2020-07-14 | 深信服科技股份有限公司 | Malicious script detection method, equipment and storage medium |
CN111400714A (en) * | 2020-04-16 | 2020-07-10 | Oppo广东移动通信有限公司 | Virus detection method, device, equipment and storage medium |
CN111400714B (en) * | 2020-04-16 | 2023-06-02 | Oppo广东移动通信有限公司 | Virus detection method, device, equipment and storage medium |
CN112784007B (en) * | 2020-07-16 | 2023-02-21 | 上海芯翌智能科技有限公司 | Text matching method and device, storage medium and computer equipment |
CN112784007A (en) * | 2020-07-16 | 2021-05-11 | 上海芯翌智能科技有限公司 | Text matching method and device, storage medium and computer equipment |
WO2022017299A1 (en) * | 2020-07-24 | 2022-01-27 | 北京字节跳动网络技术有限公司 | Text inspection method and apparatus, electronic device, and storage medium |
CN113971400B (en) * | 2020-07-24 | 2023-07-25 | 抖音视界有限公司 | Text detection method and device, electronic equipment and storage medium |
CN113971400A (en) * | 2020-07-24 | 2022-01-25 | 北京字节跳动网络技术有限公司 | Text detection method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109213859A (en) | A kind of Method for text detection, apparatus and system | |
Salminen et al. | Creating and detecting fake reviews of online products | |
Wu et al. | Ai-generated content (aigc): A survey | |
Wu et al. | OpinionSeer: interactive visualization of hotel customer feedback | |
CN103377262B (en) | The method and apparatus being grouped to user | |
CN107153641A (en) | Comment information determines method, device, server and storage medium | |
CN108701118A (en) | Semantic classes is classified | |
CN107977415A (en) | Automatic question-answering method and device | |
CN108345692A (en) | A kind of automatic question-answering method and system | |
CN110019812A (en) | A kind of user is from production content detection algorithm and system | |
CN106776936A (en) | intelligent interactive method and system | |
CN108416028A (en) | A kind of method, apparatus and server of search content resource | |
CN106886518A (en) | A kind of method of microblog account classification | |
CN107657056A (en) | Method and apparatus based on artificial intelligence displaying comment information | |
CN102930048B (en) | Use the data rich found automatically with reference to the semanteme with vision data | |
CN109783539A (en) | Usage mining and its model building method, device and computer equipment | |
CN108256537A (en) | A kind of user gender prediction method and system | |
KR20200143991A (en) | Answer recommendation system and method based on text content and emotion analysis | |
CN110197389A (en) | A kind of user identification method and device | |
CN110362663A (en) | Adaptive more perception similarity detections and parsing | |
CN108304374A (en) | Information processing method and related product | |
Radovanović et al. | Review spam detection using machine learning | |
Ishikawa et al. | Audio-visual hybrid approach for filling mass estimation | |
KR20210058525A (en) | Method and device for classifying unstructured item data automatically for goods or services | |
CN116861258A (en) | Model processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190115 |
|
RJ01 | Rejection of invention patent application after publication |