CN109213859A - A kind of Method for text detection, apparatus and system - Google Patents

A kind of Method for text detection, apparatus and system Download PDF

Info

Publication number
CN109213859A
CN109213859A CN201710549655.8A CN201710549655A CN109213859A CN 109213859 A CN109213859 A CN 109213859A CN 201710549655 A CN201710549655 A CN 201710549655A CN 109213859 A CN109213859 A CN 109213859A
Authority
CN
China
Prior art keywords
text
data
inquiry
identified
behavioural characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710549655.8A
Other languages
Chinese (zh)
Inventor
汤佳宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710549655.8A priority Critical patent/CN109213859A/en
Publication of CN109213859A publication Critical patent/CN109213859A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0203Market surveys; Market polls
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The embodiment of the present application discloses a kind of Method for text detection, apparatus and system.The described method includes: the information content based on text to be identified generates text feature data;Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generate assemblage characteristic data;The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, the testing result of the text to be identified is determined according to processing result.Utilize each embodiment of the application, the accuracy of rubbish text identification can be improved, it can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, there is better accuracy, stability and timeliness, improve the safety of text information environment.

Description

A kind of Method for text detection, apparatus and system
Technical field
The application belongs to computer data processing technology field more particularly to a kind of Method for text detection, apparatus and system.
Background technique
With the rapid development of Internet technology and universal, the type of business website is also more and more, and business tine is also got over Come abundanter.Currently, inquiry is the important means that both parties are linked up in business website, buyer may be implemented and the seller effectively pushes away Wide product or the business demand for obtaining other side.
The inquiry typically refers to buyer by message mode in business website and inquires having inside the Pass for product to seller Hold, such as price, specification etc..The total number of the word or phrase in inquiry is constituted generally within 200, belongs to short text content, Such as common short text type has: comment, message, short message etc., can be sent inquiry by mail, instant messaging tools etc. To other side.But in business website, mail, RFQ, (Request for Quotation's is write a Chinese character in simplified form, and is that a kind of buyer passes through handle at present The detailed description of procurement demand is sent to open market, allows seller to look for buyer and provide the business model of quotation) etc. inquiry Or in the service environment of similar inquiry, it is usually present a large amount of rubbish inquiry, causes the information interference to user, and bring money The risks such as gold, account, information leakage.Rubbish inquiry typically refer to that buyer sends to seller for seller without practical business The inquiry of meaning, the type for including is varied, mainly includes text garbage inquiry, fishing inquiry, advertisement inquiry etc..Especially Fishing inquiry by camouflage the purpose is to cheat addressee for information-replies such as account, passwords to specified recipient, or is drawn It leads addressee and is connected to special webpage, these webpages would generally disguise oneself as actual site, such as bank or the net of financing Page, so that registrant takes it seriously, when lander logs on these webpages, account number cipher will be stolen.
The identification method of common rubbish inquiry is mainly based upon the identification of inquiry content of text in existing, such as simple pattra leaves This scheme.This mode can identify pure text based rubbish inquiry to a certain extent, but for going fishing, cheating class Other inquiry, since the information content of inquiry and normal inquiry similitude are very high, it is difficult to which text distinguishes.For going fishing, taking advantage of Cheat classification inquiry, the mode that business is usually taken is to first pass through the strategies such as detection, judgement to identify rubbish account, then again by Rubbish account relating goes out rubbish inquiry.This method needs to accumulate the behavioral data of certain time, therefore asking with hysteresis quality Topic.
The identification method of rubbish inquiry is common in the prior art is individually modeled for different rubbish inquiry types, is examined Survey mode is single, and recognition result has locality (such as above-mentioned can identify to inquiry content of text can not but identify fishing inquiry) And hysteresis quality, so that the accuracy of entirety rubbish inquiry identification is lower at present, recognition effect is poor, reduces user experience and inquiry The safety of disk information.
Summary of the invention
The application is designed to provide a kind of Method for text detection, apparatus and system, and rubbish text identification can be improved Accuracy can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, and it is preferably accurate to have Property, stability and timeliness, improve the safety of text information environment.
A kind of Method for text detection provided by the present application, apparatus and system be include that following mode is realized:
A kind of Method for text detection, which comprises
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified.
A kind of rubbish inquiry detection method, comprising:
The information content based on inquiry to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the inquiry to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the rubbish inquiry identification model constructed offline in advance, according to processing As a result judge whether the inquiry to be identified is rubbish inquiry.
A kind of text detection device, described device include:
Text feature abstraction module generates text feature data for the information content based on text to be identified;
Behavioural characteristic abstraction module, for obtaining the behavior number of user's predefined type associated with the text to be identified According to generation behavioural characteristic data;
Feature combination module, for that will include the data of the text feature data and behavioural characteristic data according to predetermined party Formula is combined, and generates assemblage characteristic data;
Detection module, for using the assemblage characteristic identification model constructed in advance to the assemblage characteristic data at Reason, the testing result of the text to be identified is determined according to processing result.
A kind of text detection device, comprising:
First abstraction module generates text feature data for the information content based on inquiry to be identified;
Second abstraction module, for obtaining the behavioral data of user's predefined type associated with the text to be identified, Generate behavioural characteristic data;
Feature combination module, for that will include the data of the text feature data and behavioural characteristic data according to predetermined party Formula is combined, and generates assemblage characteristic data;
Inquiry detection module, for utilizing the rubbish inquiry identification model constructed offline in advance to the assemblage characteristic data It is handled, judges whether the inquiry to be identified is rubbish inquiry according to processing result.
A kind of text detection device, it is described including processor and for the memory of storage processor executable instruction Processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified.
A kind of computer readable storage medium is stored thereon with computer instruction, and it is following that described instruction is performed realization Step:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified
A kind of rubbish text detection system, including text detection device described in any one embodiment in this specification.
A kind of Method for text detection provided by the present application, apparatus and system, when detecting text, while having used in text Appearance and behavioral data, the characteristic of the two type is combined, unified machine learning model can be used and be trained And prediction.The application can carry out text detection in this dimension of content of text using the characteristic of content of text messages, together When, the application (may include send before with a period of time later) can just obtain the text when text information is sent Then behavioral data is combined using the characteristic of behavior data and the characteristic of content of text messages, formed new Text detection data.Compared to the hysteresis quality for playing the modes such as existing rubbish account identification, the application embodiment is to rubbish The timeliness of text identification detection is greatly enhanced.The application can be special by content of text and behavior using historical data offline Sign data are combined, and are trained using the same machine learning model, then text can be identified on line, Detection, can also avoid in existing certain embodiments the intervention of artificial experience in this way and thus bring recognition result is not Controllably, de-stabilising effect.Therefore, using the application embodiment, can just be obtained when text is sent during sending with this The behavioral data of textual association, and the combination with the data of content of text progress characteristic, can greatly improve rubbish text The accuracy and timeliness of detection, improve the safety of text information.And by the characteristic after combination previously according to going through It is detected in the identification model of the assemblage characteristic data building of history text, the result that text detection can be made to identify is more steady It is fixed, reliable.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application, for those of ordinary skill in the art, in the premise of not making the creative labor property Under, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of herein described Method for text detection embodiment;
Fig. 2 is the implementation method schematic diagram of a training building assemblage characteristic identification model in herein described method;
Fig. 3 is a kind of off-line training and to construct the treatment process signal of assemblage characteristic identification model in herein described method Figure;
Fig. 4 is a kind of embodiment schematic diagram of text feature data and behavioural characteristic data combination provided by the present application;
Fig. 5 is the embodiment signal of another text feature data provided by the present application and the combination of behavioural characteristic data Figure;
Fig. 6 is the treatment process schematic diagram that rubbish inquiry is identified in the application one embodiment;
Fig. 7 is a kind of method flow schematic diagram for identifying rubbish inquiry provided by the present application;
Fig. 8 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application.
Fig. 9 is a kind of modular structure schematic diagram of inquiry detection device embodiment provided by the present application;
Figure 10 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without creative efforts Range.
Fig. 1 is a kind of flow diagram of herein described Method for text detection embodiment.Although this application provides such as Following embodiments or method operating procedure shown in the drawings or apparatus structure, but exist based on routine or without creative labor Less operating procedure or modular unit after may include more in the method or device or part merging.In logicality In the step of there is no necessary causalities or structure, the execution sequence of these steps or the modular structure of device are not limited to this Shen It please embodiment or execution shown in the drawings sequence or modular structure.The device in practice of the method or modular structure, Server or end product are in application, can be according to embodiment or method shown in the drawings or modular structure carry out sequence execution It is either parallel execute (such as parallel processor or multiple threads environment, even include distributed treatment, server cluster Implementation environment).
Existing rubbish inquiry is mainly to a large amount of sellers send rubbish inquiry text, fishing is ask in the form of Webpage The rubbish inquiry of disk, advertisement inquiry etc., this " extensively casting net " mode can seriously reduce user experience and information security.The application In the embodiment of offer, it is contemplated that, fraud, fishing class inquiry inquiry transmission process behavior expression and normal inquiry Difference, therefore behavioral data can be just utilized when inquiry is sent, in combination with the characteristic of content of text, to detect text This.The application is to identify that page rubbish inquiry is illustrated the application embodiment as application scenarios below.But this field Technical staff is it can be understood that the connotation of this programme can be applied in the implement scene of other text detections, such as rubbish Malice comment, malice message in rubbish mail recognition, the page or instant messaging tools, RFQ business scenario are medium.It will not do below Replaceability description, is stated for applying the applicability in other implement scenes not tire out herein.
In a kind of a kind of implementation such as of Method for text detection provided by the present application, history inquiry can be utilized offline in advance Data are input in the machine learning model of selection and are trained, and building, which generates, is based on content of text and behavioral data assemblage characteristic Text identification model.Then it can use the model and recognition detection, output test result carried out to text to be identified.Specifically A kind of Method for text detection implementation process as shown in Figure 1, may include:
S1: the information content based on text to be identified generates text feature data;
S2: the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic number are obtained According to;
S3: the data including the text feature data and behavioural characteristic data are combined according to predetermined way, raw At combination characteristic;
S4: the assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing As a result the testing result of the text to be identified is determined
In general, the text usually may include the one piece of data information for recording and storing text information, it can To there is individual text label, for distinguishing different text data set.Specifically in the present embodiment application scenarios, text It may include the information contents such as comment, message, short message.The text can pass through mail, instant messaging tools, resource packet etc. Form is transmitted, and realizes the information exchange of sender and recipient, as that can pass through the side of letter in standing in a kind of implement scene Formula sends the inquiry of page format to recipient.It should be noted that text described herein is not intended to limit the interior of text Appearance is the data format of text, character, in other application scenarios, can also include but is not limited in the text image, The data information of the formats such as sound, video, the advertisement that may include graphic form such as the information content in inquiry, timing play 5 Flash, merchant page link of second etc..
In addition, the rubbish inquiry in the present embodiment application scenarios identifies, however it is not limited to which conventional buyer sends out to one side of seller Send the scene of inquiry.Inquiry described in the present embodiment can refer to that transaction prepares purchase or sells a direction of certain commodity The business conduct for a possibility that potential supplier or buyer seek bargain or the transaction of the commodity.Therefore, recognition detection Text also may include inquiry that seller sends to seller.
In the present embodiment, can first with text historical data training machine learning model, can be with structure by training Build the assemblage characteristic identification model that text to be identified is detected on line.The machine learning model, typically refers to computer It is handled using existing data by some machine learning algorithms, model is trained, then using model to new data Predict and export the processing logical collection of result.The present embodiment can use random forest, Logic Regression Models, convolutional Neural net A variety of model realizations such as network.In general, in machine learning, can by these data with existing by machine learning algorithm at The process of reason is called training, and the result of processing can be used to predict new data, this result is commonly referred to as mould Type is called prediction to the prediction of new data in the process.
The historical data data that may include text specifying information content that model training uses and the text are to application Then the behavioral data at family can extract the feature of both data, be spliced, generate the sample of machine learning model training Data.In a kind of specific embodiment, the assemblage characteristic identification model may include being generated using following manner building:
S30: obtain identification text historical data, extract it is described identification text text feature data, and obtain with The behavioural characteristic data of the identification associated user's predefined type of text;
S31: the text feature data and behavioural characteristic data are spliced according to the predetermined way, produce sample Characteristic;
S32: the sample characteristics data are trained in the machine learning training pattern of selection, obtain the combination Feature identification model.
Fig. 2 is the implementation method schematic diagram of a training building assemblage characteristic identification model in herein described method.This Sample identifies mould using the assemblage characteristic that the historical data building of the characteristic and behavioral data that include content of text messages generates Type can be used for detecting text to be identified, determine the classification of text to be identified.Certainly, the historical data can wrap The historical data that a period of time acquisition obtains is included, also may include the data of real-time acquisition storage, such as obtain daily inquiry in real time Disk is then stored in the update that historical data is completed in database, and so that assemblage characteristic identification model is further trained, optimization is adjusted Mould preparation shape parameter improves the accuracy of text classification.The text feature data can be extracted from the historical data and be obtained , in a kind of implement scene, the behavioural characteristic data, or other embodiments can also be obtained from the historical data In, the behavioural characteristic data of user's subscription type can be obtained from another database, such as by burying a user for acquisition The data information of operation behavior.
It, can first offline structure on line before detection identification rubbish inquiry in the specific implement scene of the application Build the model of identification rubbish inquiry.The model can be machine learning model, be trained using history inquiry data, and can be with The inquiry to be identified newly inputted is predicted on line, exports the recognition detection result of inquiry to be identified.The engineering Model is practised, computer is typically referred to and is handled using existing data by some machine learning algorithms, train model, so New data are predicted and export the processing logical collection of result using model afterwards.In general, in machine learning, it can be by this A little data with existing are called training by the process that machine learning algorithm is handled, and the result of processing can be used to new number According to being predicted, this result is commonly referred to as model, to being called prediction during the prediction of new data.It can be in the present embodiment Training pattern of the random forest as history inquiry data.Certainly, in other implement scenes can also according to process demand or Actual implementation link, which is chosen, for example chooses Logic Regression Models (LR, Logistic Regression), neural network, Huo Zhetong Count the machine learning models such as model such as scorecard model.
In the application scenarios of the present embodiment page rubbish inquiry, the history of a period of time can be first extracted from database Inquiry data.These history inquiry data may include the row before and after content of text messages and transmission inquiry in specific inquiry For data.It (may include during inquiry is sent, for just that the behavioral data, which may include before and after the transmission of inquiry, In description, herein by include inquiry send before, inquiry send after, inquiry send during in any one stage period It is collectively referred to as before and after sending, it is equally applicable in other implement scenes, such as text sends front and back, text to be identified sends front and back, stays Speech sends front and back etc..) capture the page operation behavioral data of acquisition or the data information of relevant specified type.In the present embodiment The behavioral data of the predefined type may include identify text sender the identification text send before and after generate it is pre- If the data information of behavior type.Specifically for example may include send the front and back page total residence time, mouse movement speed, Route (track of mouse sliding), account login path, account continuously transmit the time interval of inquiry, browse path etc..
In the behavioral data of above-mentioned predefined type, wherein one or more types can only include text, such as inquiry, send The data generated before, before only may include perhaps transmission including the behavioral data generated after sending or some types The behavioral data for having altogether/generating jointly later.In such as scene of above-mentioned one inquiry of transmission, the behavior number of a predefined type According to may include from enter inquiry edit page until send inquiry after leave the page residence time of the page in total.Or it is another The behavioral data obtained in one example may include IP address modification information, if sender is at interval of fixed cycle or transmission After the inquiry of fixed quantity, IP address can be actively changed.The behavioral data for needing to obtain which type can be specifically preset, Then after obtaining history inquiry data, the behavioral data of these type inquiry can be extracted.
Certainly, it in some implement scenes, if there is no these behavioral datas in history inquiry data, can set pre- The behavioral data of which type is first acquired, is then carrying out model training using these historical datas after acquiring a period of time.
After obtaining history inquiry data, the behavioral data and inquiry content of text of the transmission of inquiry process can be therefrom sorted out Two class data, as shown in figure 3, Fig. 3 is a kind of off-line training and to construct assemblage characteristic identification model in herein described method Treatment process schematic diagram.Further, characteristic can be extracted from inquiry content of text data and behavioral data respectively. In inquiry content of text data, text feature data can be generated based on the information content that inquiry content of text specifically includes. In a kind of embodiment provided by the present application, inquiry content map can be arrived for example, by word2vec, doc2vec mode Then one higher dimensional space can be that the vector of n dimension is indicated with a length, such as [w1, w2, w3 ... wn].Specifically, described In one embodiment of method, the text feature data include:
S101: the information content of the identification text is mapped to the vector that the length generated after higher dimensional space is n dimension, n ≥1。
Doc2vec is based on the method developed on the basis of Word2vec, is that one kind can be by a sentence or a piece Chapter is mapped to the technology of a higher dimensional space, and one section of sentence can be characterized as real number value vector by it, each value of vector can To represent the coordinate in the dimension.Which can use the mode of deep learning, can be by training, to content of text Processing be reduced to the vector operation in n-dimensional vector space, and the similarity in vector space can be used to indicate text semantic On similarity.Therefore, it can be extracted in inquiry content of text using word2vec or doc2vec mode in the embodiment of the present application Text feature data, such text feature data can reflect the information content of inquiry to a certain extent, can be used for knowing The not rubbish inquiry based on content of text.Each inquiry, which can correspond to, generates corresponding text feature data.The embodiment of the present application Doc2vec has been selected text is mapped to hyperspace, other mapping methods or text can also be chosen in other embodiments The method that eigen extracts, such as the feature extracting method of vector space model, statistics.
For behavioral data, the behavioural characteristic data of certain data format for setting can be arranged.Specific row The format for being characterized data can be configured according to data processing needs and/or real time environment.If such as in the present embodiment It needs the behavioral data of m kind predefined type to participate in calculating, to identify rubbish inquiry, then can set behavioural characteristic data to Length is the feature vector of m dimension, such as [a1, a2, a3 ..., am].Wherein each dimension can represent the row of a predefined type For data, specific value can be obtained according to conversion, the calculating etc. of behavioral data.Therefore, one embodiment of the application In, the behavioral data may include:
S201: being corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type On coordinate value generate length be m dimension vector, m >=1.
In a specific example, if including in the behavioral data obtained, an inquiry sends preceding in the stop of the page Between be 5 seconds, be 3 in website browsing page number, mouse movement speed is 1 meter per second, then behavioural characteristic data can be generated can be with The feature vector [5,3,1] for being 3 for length.Each vector dimension can be a predetermined class in the behavioural characteristic data Type, as in [a1, a2, a3 ..., am], it is preceding in the residence time of the page, a2 that a1 vector dimension can indicate that the inquiry obtained is sent Indicate the sum etc. for the browsing pages that current browser is opened.In a kind of embodiment, the behavioral data that is extracted in historical data There can be mapping relations with behavioural characteristic data, i.e., the value of each vector dimension can be obtained directly in behavioural characteristic data Correspondence behavioral data value, can also by behavioral data by deformation, transformation, calculate etc. generate text behavioural characteristic number According to.Such as sender's account login mode that certain behavioral data is record, if it is " account number cipher login ", record is characterized Vector a4, value 0;If login mode is " two-dimensional code scanning login ", the value of a4 is 1.And so on, it can be at certain On the behavioral data of a type, it converts behavioural characteristic data accordingly by it.
In addition, the value of each predefined type behavioral data can preset sequence side in the behavioural characteristic data Formula, as shown in figure 4, can be [page residence time, website traffic page sum, mouse movement speed, account login mode], [account login mode, website traffic page sum, mouse movement speed, page stop can also be designed as according to design requirement Time].In the present embodiment, the different sortord of vector dimension can produce different behavioural characteristic numbers in feature vector According to.Correspondingly, can also generate different assemblage characteristic vectors in subsequent characteristics combination.It in this way can also be from the same text Text detection in multiple assemblage characteristic data implementation model training or line.
History inquiry data pass through the extraction of above-mentioned text feature and behavioural characteristic, and each inquiry can correspond to a n dimension The text feature data of feature vector and the behavioural characteristic data of m dimensional feature vector.It then can be by text feature data and row It is characterized data to be spliced and combined, generates the sample characteristics data of corresponding inquiry.
Text feature data and the combination of behavioural characteristic data can be used in the embodiment of the present application, certainly, the application does not arrange Except the data information of other users recognition detection text can also be added in other embodiments, such as the account of inquiry sender Number, the credit data of inquiry sender log in IP or the entry address of districts and cities etc..These data can be with one or more groups of New feature vector is combined with text feature data, behavioural characteristic data, if credit score is 69, then credit feature data T can be [69].Alternatively, entry address is stepped on text sender registered address or often in the characteristic A of entry address It is consistent to record address, then A value is 0, if differing bigger in the two physical address or logical address, the value of A is bigger.Then will The value of A and/T participate in assemblage characteristic data, carry out mould together with institute's text feature data and behavioural characteristic data Type training/text identification.
The combination of the text feature data and behavioural characteristic data, specific combination can be according to data processing It needs, the parameter setting of machine learning model etc. uses different embodiments.Dimension is such as corresponded to be added, after partial data weighting It combines, be converted to and combine after another data renormalization etc..In a kind of embodiment provided by the present application, it is described will be described The text feature data and behavioural characteristic data for identifying text are combined according to predetermined way and may include:
S301: the text feature data are mutually spliced with behavioural characteristic data, generate the group that length is (n+m) dimension Close characteristic.
Such as in the implement scene of this implementation identification rubbish inquiry, text feature data and behavioural characteristic data are spliced Afterwards, the new assemblage characteristic data of formation be (n+m) dimension feature vector can for [w1, w2, w3 ..., wn, a1, a2, A3 ..., am].Certainly, other data, such as credit feature can also be added in other embodiments in assemblage characteristic data Data, then assemblage characteristic data can be [w1, w2, w3 ... wn, a1, a2, a3 ..., am, t].In assemblage characteristic data not Syntagmatic customized can be arranged before and after the data of same type, credit data can such as be placed above the other things, behind let pass again It is characterized data, text feature data.Fig. 4 is a kind of text feature data and behavioural characteristic data combination provided by the present application Embodiment schematic diagram.
The combination can also include other embodiments, identify the text feature data of text as will be described Operation is carried out with value of the behavioural characteristic data in corresponding dimension, is such as added, is multiplied, or after being calculated according to other preset algorithms A numerical value in the correspondence dimension is synthesized, the assemblage characteristic data in the corresponding dimension are obtained.Such as using corresponding dimension On value be added then available [w1+a1, w2+a2, w3+a3 ... wn+am], wherein vacancy position can if n is unequal with m With 0 or the replacement of other preset values.Fig. 5 is another text feature data provided by the present application and the combination of behavioural characteristic data Embodiment schematic diagram can generate n data if n ratio m is big by 2 in assemblage characteristic data, wherein last two are w (n- 1)、n。
After forming sample characteristics data, it can be input in the Random Forest model of the present embodiment use and be trained.Make The sample characteristics data training generated with history inquiry data, after reaching certain prediction index, can be supplied on line makes With carrying out detection classification to inquiry to be identified, judge whether it is rubbish inquiry.In machine learning, random forest is a packet Classifier containing multiple decision trees.Such as main treatment process may include: in a kind of implementation
First, take the sampling put back to from sample data concentration, construct Sub Data Set, the data volume of Sub Data Set be with Raw data set is identical.The element of different Sub Data Sets can repeat, and the element in the same Sub Data Set can also repeat.
Second, sub-tree is constructed using Sub Data Set, this data is put into each sub-tree, every height is determined Plan tree exports a result.Finally, if there is new data need to obtain classification results by random forest, so that it may pass through Ballot to the judging result of sub-tree obtains the output result of random forest.
For example, it is assumed that have 3 stalk decision trees in random forest, and after text-processing to be identified, the classification results of 2 stalk trees It is A class rubbish inquiry, the classification results of 1 stalk tree are the normal inquiry of B class, then the classification results of random forest are exactly A class rubbish Rubbish inquiry.
It is constructed using aforesaid way after generating assemblage characteristic identification model, the identification that text can be carried out for inline system is examined It surveys.When needing to detect some text, text feature data can be generated based on the information content of text to be identified, while can be with The behavioral data of the corresponding text to be identified is obtained, and generates behavioural characteristic data.Then both classes will can be included at least After another characteristic data are combined, are come out, obtained to be identified using the assemblage characteristic identification model that above-mentioned training generates The recognition detection result of text.A specific implement scene is as shown in fig. 6, Fig. 6 is to identify rubbish in the application one embodiment The treatment process schematic diagram of rubbish inquiry may include data Kuku and the specific text text for storing user's inquiry process state data Then the database of content-data extracts behavioural characteristic data respectively and text feature data is trained.Obtain it is described to The behavioural characteristic data of identification text may include the default behavior for obtaining user and generating before and after the text to be identified is sent The data information of type can greatly improve the timeliness of text detection in this way.And due to automatically extracting content of text messages And behavioral data, it is trained and identifies using the characteristic that uniform rules is formed after combination, it is possible to reduce artificial subjectivity Intervene, the accuracy and reliability of text identification is also significantly increased.
It should be noted that including the treatment process and real-time detection text to be identified of the building of said combination feature identification model In this treatment process, the application does not limit the sequencing for obtaining text feature data and acquisition behavioural characteristic data It is fixed, it can be handled simultaneously in some embodiments and obtain two kinds of data.In addition, the historical data can be and contain text The data of the information content and with text to associated attribute information, operation behavior, interlock account login behavior etc. behavior number According to.In some implement scenes, if only obtaining the historical data of text, and behavioral data using other modes or from other Database (data service that such as third party provides) acquires, and present techniques personnel similarly can be understood as belonging to this Shen The scope of historical data that please be described.
Above-mentioned described embodiment using identify rubbish inquiry as implement scene carry out scheme description, but be based on the application Substantive innovative idea can be also used for the text identification of other business scenarios, such as spam, maliciously comment on, message of pouring water, i.e. When log, Email attachment, identification, detection, the classification of texts such as attachment of instant messaging transmission etc..Therefore, the method Another embodiment in, the text to be identified may include at least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
A kind of Method for text detection provided by the present application, while content of text and behavioral data have been used, by the two type Characteristic be combined, unified machine learning model can be used and be trained and predict.The application uses text envelope The characteristic for ceasing content can carry out text detection in this dimension of content of text, meanwhile, the application can be in text information Just the behavioral data of the text is obtained when transmission, then utilizes the feature of the characteristic of behavior data and content of text messages Data are spliced, and the data of new text detection are formed.Compared to the hysteresis quality of the modes such as existing rubbish account identification, this Shen Please embodiment be greatly enhanced to the timeliness of rubbish text recognition detection.The application can utilize historical data will offline Content of text is combined with behavioural characteristic data, is trained using unified machine learning model, then can be on line Text is identified, is detected, the intervention of artificial experience and thus can also be avoided in existing certain embodiments in this way Bring is uncontrollable, de-stabilising effect.Therefore, using the application embodiment, the transmission phase can just be obtained when text is sent Between with the associated behavioral data of the text, and with the data of content of text carry out characteristic combination, can greatly improve The accuracy and timeliness of rubbish text detection, improve the safety of text information.And by the characteristic after combination pre- It is first detected according in the identification model of the assemblage characteristic data of history text building, the knot that text detection can be made to identify Fruit is more stable, reliable.
The description of scene based on the above embodiment, the application also provide the another embodiment of the method.In this reality It applies in example, it, can be respectively by the information content of text and associated with the text in specific characteristic combined treatment The behavioral data of user is converted to corresponding feature vector, is combined using the data mode of feature vector, and it is special to generate combination Levy data.Specifically, in a kind of another embodiment of Method for text detection provided by the present application, the method may include:
The information content based on text to be identified generates Text eigenvector;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic vector;
Data including the Text eigenvector and behavioural characteristic vector are combined according to predetermined way, generation group Close feature vector;
The assemblage characteristic vector is handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified.
In this way, the data format using vector carries out text feature data and the extraction of behavioural characteristic data, conversion, group The processing such as conjunction, training, can further simplify data processing, improve computer digital animation speed, and then it is literary to improve rubbish This detection, recognition efficiency.
Method described above has good recognition effect to identification rubbish inquiry, can effectively improve the identification of rubbish inquiry Accuracy rate and recall rate, reduce user receive rubbish inquiry frequency and enter wherein, fishing the page risk.Therefore, originally Application, which also provides one kind, can be used for rubbish inquiry knowledge method for distinguishing, as shown in fig. 7, Fig. 7 is a kind of identification provided by the present application The method flow schematic diagram of rubbish inquiry specifically may include:
S40: the information content based on inquiry to be identified generates text feature data;
S41: the behavioral data of user associated with the inquiry to be identified, generation behavioural characteristic data are obtained;
S42: the data including the text feature data and behavioural characteristic data are combined according to predetermined way, raw At combination characteristic;
S43: being handled the assemblage characteristic data using the rubbish inquiry identification model constructed offline in advance, according to Processing result judges whether the inquiry to be identified is rubbish inquiry.
Certainly, embodiment is also applied in RFQ scene.In this way, using the application embodiment, in some business On website, the rubbish inquiry and rubbish RFQ recognized can be intercepted, be tracked, raising customer experience, reduction fraud, The risks such as fishing.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Based on Method for text detection described above, the application also provides a kind of text detection device.The device can With include the use of the system (including distributed system) of herein described method, software (application), module, component, server, Client etc. simultaneously combines the necessary device for implementing hardware.Based on same innovation thinking, in a kind of embodiment provided by the present application Device as described in the following examples.Since the implementation that device solves the problems, such as is similar to method, the application is specific The implementation of device may refer to the implementation of preceding method, overlaps will not be repeated.It is used below, term " unit " or The combination of the software and/or hardware of predetermined function may be implemented in person's " module ".Although device described in following embodiment is preferable Ground is realized with software, but the realization of the combination of hardware or software and hardware is also that may and be contemplated.Specifically, Fig. 8 is a kind of modular structure schematic diagram of text detection Installation practice provided by the present application, as shown in figure 8, described device can To include:
Text feature abstraction module 101 can be used for the information content based on text to be identified and generate text feature data;
Behavioural characteristic abstraction module 102 can be used for obtaining user's predefined type associated with the text to be identified Behavioral data, generate behavioural characteristic data;
Feature combination module 103, can be used for will include the text feature data and behavioural characteristic data data by It is combined according to predetermined way, generates assemblage characteristic data;
Detection module 104 can be used for using the assemblage characteristic identification model constructed in advance to the assemblage characteristic data It is handled, the testing result of the text to be identified is determined according to processing result.
The assemblage characteristic identification model can be trained generation previously according to the historical data of text.It can choose Machine learning model is trained, and the historical data that training uses may include the characteristic and behavior of the information content of text Data.In specific one embodiment, may include: in the detection module 104
Model training module 1041 can be used for obtaining the historical data of identification text, extract the text of the identification text Eigen data and the behavioural characteristic data for obtaining user's predefined type associated with the identification text;It can be also used for The text feature data and behavioural characteristic data are combined according to the predetermined way, produce sample characteristics data;With And be trained the sample characteristics data in the machine learning training pattern of selection, obtain the assemblage characteristic identification mould Type.
During the training of assemblage characteristic identification model, the behavioral data of the predefined type may include identification text The data information for the default behavior type that this sender generates before and after the identification text is sent.Correspondingly, right on line During text carries out recognition detection, the behavioral data for obtaining the text predefined type to be identified includes described in acquisition The data information for the default behavior type that the sender of text to be identified generates before and after the text to be identified is sent.
Text Feature Extraction is converted to the mode such as preceding method embodiment of corresponding text feature data and behavioural characteristic data It is described can be there are many embodiment.In a kind of embodiment of herein described device, the text feature abstraction module 101 is obtained The text feature data are taken specifically to may include:
The information content of the identification text is mapped to after higher dimensional space and generates the vector that length is n dimension, n >=1;
And the acquisition of the behavioural characteristic abstraction module 102 behavioural characteristic data specifically may include:
It is seat on corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type Scale value generates the vector that length is m dimension, m >=1.
Text feature data and the mode of behavioural characteristic data combination can be set according to different data processing needs It sets, using different combinations.In a kind of embodiment of described device provided by the present application, the feature combination module 103 May include at least one of following:
Merging features module 1031 can be used for the behavioural characteristic data of the n text feature data tieed up and m dimension carrying out phase Splicing generates the assemblage characteristic data that length is (n+m) dimension;
Dimensional characteristics computing module 1032 can be used for the text feature data and behavioural characteristic number of the identification text Operation is carried out according to the value in corresponding dimension, obtains the assemblage characteristic data in the corresponding dimension.
The feature combination module 103 text feature data can be carried out using any one above-mentioned module and behavior is special Levy the combination of data.
Described device can be used in rubbish inquiry identifying in the application scenarios of inquiry, and the text to be identified described at this time can be with Inquiry information including rubbish inquiry, can also apply in other implement scenes, such as spam detection, malice comment, RFQ In business scenario.Each implement scene can it is corresponding using machine learning model to the text feature data of historical data and Assemblage characteristic data after the combination of behavioural characteristic data carry out model training, then the corresponding text of recognition detection.Therefore, described Text to be identified may include at least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
It is, of course, also possible to include other embodiments, such as the assemblage characteristic data can be with account information data, letter With data, entry address etc., other detect the information data of text type for identification.
The application also provides a kind of device suitable for identifying rubbish inquiry, as shown in figure 9, Fig. 9 is provided by the present application A kind of modular structure schematic diagram of inquiry detection device embodiment may include:
First abstraction module 201 can be used for the information content based on inquiry to be identified and generate text feature data;
Second abstraction module 202 can be used for obtaining the behavioral data of user associated with the inquiry to be identified, give birth to At behavioural characteristic data;
Feature combination module 203, can be used for will include the text feature data and behavioural characteristic data data by It is combined according to predetermined way, generates assemblage characteristic data;
Inquiry detection module 204 can be used for using the rubbish inquiry identification model constructed offline in advance to the combination Characteristic is handled, and judges whether the inquiry to be identified is rubbish inquiry according to processing result.
Certainly, the rubbish inquiry identification model can according to the history inquiry data of a period of time of extraction, including Inquiry text and send behavioral data before and after inquiry, carry out respectively text feature extract and behavioural characteristic extract after for example random It constructs and generates after being trained in the machine learning module of forest.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for module class For the embodiment of device, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to side The part of method embodiment illustrates.
The application method or apparatus described above can be implemented by computer program in conjunction with necessary hardware, Ke Yishe It sets in the terminal devices such as mobile terminal, server, distributed system, it is more quasi- in combination with content of text and behavioral data Really, reliably, high-timeliness to such as rubbish inquiry, spam, message of pouring water, malice comment etc. texts carry out identification inspection It surveys.Therefore, the application also provides a kind of text detection device, may include processor and finger can be performed for storage processor The memory of order, the processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified.
Method or apparatus described in the above embodiments of the present application can realize service logic and be recorded by computer program On a storage medium, the storage medium can be read and be executed with computer, realize scheme described by the embodiment of the present application Effect.Therefore, the application also provides a kind of computer readable storage medium, is stored thereon with computer instruction, described instruction quilt Following steps may be implemented when execution:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, generation group Close characteristic;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, according to processing result Determine the testing result of the text to be identified.
The computer readable storage medium may include the physical unit for storing information, usually by message digit It is stored again by the media in the way of electricity, magnetic or optics etc. after change.Computer-readable storage medium described in the present embodiment It may include: that the device of information is stored in the way of electric energy such as that matter, which has, various memory, such as RAM, ROM;In the way of magnetic energy Store information device such as, hard disk, floppy disk, tape, core memory, magnetic bubble memory, USB flash disk;It is stored and is believed using optical mode The device of breath such as, CD or DVD.Certainly, there are also the readable storage medium storing program for executing of other modes, such as quantum memory, graphene to store Device etc..
It should be noted that description of this specification device or electronic equipment described above according to related method embodiment It can also include other embodiments, concrete implementation mode is referred to the description of embodiment of the method, does not make herein one by one It repeats.All the embodiments in this specification are described in a progressive manner, same and similar part between each embodiment It may refer to each other, each embodiment focuses on the differences from other embodiments.Especially for hardware+journey For sequence class, storage medium+program embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, phase Place is closed to illustrate referring to the part of embodiment of the method.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Device or method described above can be used in multiple business system, can effectively improve rubbish text identification Accuracy can in time, effectively detect the rubbish texts such as rubbish inquiry, spam, malice comment, and it is preferably accurate to have Property, stability and timeliness, improve the safety of text information environment in operation system, improve user experience.Therefore, the application A kind of rubbish text detection system is also provided, can be individual rubbish text detection system, can also apply in multiple types Text services system in, such as inquiry operation system or mailing system.The TEXT system may include in above-described embodiment Text detection device described in any one.
The system can be individual server, be also possible to the system cluster of more application servers composition, It can be the service in distributed system.Figure 10 is a kind of knot of embodiment of a kind of rubbish text detection system provided by the present application Structure schematic diagram.
It should be noted that system described above can also include other embodiment party according to the description of embodiment of the method Formula, concrete implementation mode are referred to the description of related method embodiment, do not repeat one by one herein.
A kind of Method for text detection provided by the present application, apparatus and system, when detecting text, while having used in text Appearance and behavioral data, the characteristic of the two type is combined, unified machine learning model can be used and be trained And prediction.The application can carry out text detection in this dimension of content of text using the characteristic of content of text messages, together When, the application can just obtain the behavioral data of the text when text information is sent, and then utilize the feature of behavior data The characteristic of data and content of text messages is spliced, and the data of new text detection are formed.Compared to existing rubbish The hysteresis quality of the modes such as account identification, the application embodiment are greatly enhanced to the timeliness of rubbish text recognition detection. The application can offline be combined content of text and behavioural characteristic data using historical data, and the same engineering is used It practises model to be trained, then text can be identified on line, detected, existing certain implementations can also be avoided in this way The intervention of artificial experience and thus bring is uncontrollable, de-stabilising effect in scheme.Therefore, using the application embodiment, Can text send when just obtain send during with the associated behavioral data of the text, and with the data of content of text carry out The combination of characteristic can greatly improve the accuracy and timeliness of rubbish text detection, improve the safety of text information. And the characteristic after combination is examined in the identification model of the assemblage characteristic data building previously according to history text It surveys, the result that text detection can be made to identify is more stable, reliable.
Although mentioning doc2vec higher dimensional space mapping text, random forest building training pattern, the page in teachings herein The number of behavioral data, text feature and the behaviors of types such as residence time/mouse movement speed/account login times or the like According to the description that the definition of the data such as/mode, acquisition, interaction, calculating, judgement and embodiment are realized, still, the application not office It is limited to meet industry programming language standard, normal data model/algorithm, general-purpose computer processes and storage rule Or situation described in the embodiment of the present application.Certain professional standards or the implementation base described using customized mode or embodiment On plinth embodiment modified slightly also may be implemented above-described embodiment it is identical, it is equivalent or it is close or deformation after it is anticipated that Implementation result.Using the embodiment of the acquisitions such as these modifications or deformed data acquisition, storage, judgement, processing mode, still It may belong within the scope of the optional embodiment of the application.
In the 1990s, the improvement of a technology can be distinguished clearly be on hardware improvement (for example, Improvement to circuit structures such as diode, transistor, switches) or software on improvement (improvement for method flow).So And with the development of technology, the improvement of current many method flows can be considered as directly improving for hardware circuit. Designer nearly all obtains corresponding hardware circuit by the way that improved method flow to be programmed into hardware circuit.Cause This, it cannot be said that the improvement of a method flow cannot be realized with hardware entities module.For example, programmable logic device (Programmable Logic Device, PLD) (such as field programmable gate array (Field Programmable Gate Array, FPGA)) it is exactly such a integrated circuit, logic function determines device programming by user.By designer Voluntarily programming comes a digital display circuit " integrated " on a piece of PLD, designs and makes without asking chip maker Dedicated IC chip.Moreover, nowadays, substitution manually makes IC chip, this programming is also used instead mostly " is patrolled Volume compiler (logic compiler) " software realizes that software compiler used is similar when it writes with program development, And the source code before compiling also write by handy specific programming language, this is referred to as hardware description language (Hardware Description Language, HDL), and HDL is also not only a kind of, but there are many kind, such as ABEL (Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL (Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language) etc., VHDL (Very-High-Speed is most generally used at present Integrated Circuit Hardware Description Language) and Verilog.Those skilled in the art also answer This understands, it is only necessary to method flow slightly programming in logic and is programmed into integrated circuit with above-mentioned several hardware description languages, The hardware circuit for realizing the logical method process can be readily available.
Controller can be implemented in any suitable manner, for example, controller can take such as microprocessor or processing The computer for the computer readable program code (such as software or firmware) that device and storage can be executed by (micro-) processor can Read medium, logic gate, switch, specific integrated circuit (Application Specific Integrated Circuit, ASIC), the form of programmable logic controller (PLC) and insertion microcontroller, the example of controller includes but is not limited to following microcontroller Device: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicone Labs C8051F320 are deposited Memory controller is also implemented as a part of the control logic of memory.It is also known in the art that in addition to Pure computer readable program code mode is realized other than controller, can be made completely by the way that method and step is carried out programming in logic Controller is obtained to come in fact in the form of logic gate, switch, specific integrated circuit, programmable logic controller (PLC) and insertion microcontroller etc. Existing identical function.Therefore this controller is considered a kind of hardware component, and to including for realizing various in it The device of function can also be considered as the structure in hardware component.Or even, it can will be regarded for realizing the device of various functions For either the software module of implementation method can be the structure in hardware component again.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, vehicle-mounted human-computer interaction device, cellular phone, camera phone, smart phone, individual Digital assistants, media player, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or The combination of any equipment in these equipment of person.
Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive The means for the property made may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps One of execution sequence mode, does not represent and unique executes sequence.It, can be with when device in practice or end product execute It is executed according to embodiment or method shown in the drawings sequence or parallel executes (such as parallel processor or multiple threads Environment, even distributed data processing environment).The terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that process, method, product or equipment including a series of elements are not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, product or equipment Intrinsic element.In the absence of more restrictions, be not precluded include the process, method of the element, product or There is also other identical or equivalent elements in person's equipment.
For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively.Certainly, implementing this The function of each module can be realized when application in the same or multiple software and or hardware, can also will realize same function Can module by multiple submodule or subelement combination realize etc..Installation practice described above is only schematic , for example, the division of the unit, only a kind of logical function partition, there may be another division manner in actual implementation, Such as multiple units or components can be combined or can be integrated into another system, or some features can be ignored, or not hold Row.Another point, shown or discussed mutual coupling, direct-coupling or communication connection can be through some interfaces, The indirect coupling or communication connection of device or unit can be electrical property, mechanical or other forms.
It is also known in the art that other than realizing controller in a manner of pure computer readable program code, it is complete Entirely can by by method and step carry out programming in logic come so that controller with logic gate, switch, specific integrated circuit, programmable Logic controller realizes identical function with the form for being embedded in microcontroller etc..Therefore this controller is considered one kind Hardware component, and the structure that the device for realizing various functions that its inside includes can also be considered as in hardware component.Or Person even, can will be considered as realizing the device of various functions either the software module of implementation method can be hardware again Structure in component.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The application can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, group Part, data structure etc..The application can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", The description of " specific example " or " some examples " etc. means specific features described in conjunction with this embodiment or example, structure, material Or feature is contained at least one embodiment or example of the application.In the present specification, to the schematic of above-mentioned term Statement is necessarily directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can Can be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, ability The technical staff in domain can be by different embodiments or examples described in this specification and the feature of different embodiments or examples It is combined.
The above description is only an example of the present application, is not intended to limit this application.For those skilled in the art For, various changes and changes are possible in this application.All any modifications made within the spirit and principles of the present application are equal Replacement, improvement etc., should be included within the scope of the claims of this application.

Claims (20)

1. a kind of Method for text detection, which is characterized in that the described method includes:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result The testing result of the text to be identified.
2. a kind of Method for text detection as described in claim 1, which is characterized in that the assemblage characteristic identification model includes adopting It is constructed and is generated with following manner:
The historical data of identification text is obtained, the text feature data of the identification text are extracted, and is obtained and the identification The behavioural characteristic data of the associated user's predefined type of text;
The text feature data and behavioural characteristic data are combined according to the predetermined way, produce sample characteristics number According to;
The sample characteristics data are trained in the machine learning training pattern of selection, obtain the assemblage characteristic identification Model.
3. a kind of Method for text detection as claimed in claim 2, which is characterized in that the behavioral data of the predefined type includes Identify the data information for the default behavior type that the sender of text generates before and after the identification text is sent.
4. a kind of Method for text detection as claimed in claim 2, which is characterized in that the assemblage characteristic data also include at least It is one of following:
Account information data, credit data, entry address.
5. a kind of Method for text detection as claimed in claim 2, which is characterized in that the text feature data include will be described The information content of identification text is mapped to the vector that the length generated after higher dimensional space is n dimension, n >=1;
It is pair that the behavioural characteristic data, which include by the value of vector dimension, the behavioral data of predefined type of the predefined type, The length for answering the coordinate value on vector dimension to generate is the vector of m dimension, m >=1.
6. a kind of Method for text detection as claimed in claim 5, which is characterized in that the text by the identification text is special Sign data and behavioural characteristic data are combined according to predetermined way and include:
The text feature data are mutually spliced with behavioural characteristic data, generate the assemblage characteristic number that length is (n+m) dimension According to.
7. a kind of Method for text detection as claimed in claim 5, which is characterized in that the text by the identification text is special Sign data and behavioural characteristic data are combined according to predetermined way and include:
The value of the text feature data and behavioural characteristic data of the identification text in corresponding dimension is subjected to operation, is obtained Assemblage characteristic data in the corresponding dimension.
8. the Method for text detection as described in any one of claim 1-7, which is characterized in that the text to be identified includes At least one of following text types:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
9. a kind of rubbish inquiry detection method characterized by comprising
The information content based on inquiry to be identified generates text feature data;
Obtain the behavioral data of user associated with the inquiry to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination Levy data;
The assemblage characteristic data are handled using the rubbish inquiry identification model constructed offline in advance, according to processing result Judge whether the inquiry to be identified is rubbish inquiry.
10. a kind of text detection device, which is characterized in that described device includes:
Text feature abstraction module generates text feature data for the information content based on text to be identified;
Behavioural characteristic abstraction module, for obtaining the behavioral data of user's predefined type associated with the text to be identified, Generate behavioural characteristic data;
Feature combination module, for will include the text feature data and behavioural characteristic data data according to predetermined way into Row combination, generates assemblage characteristic data;
Detection module, for being handled using the assemblage characteristic identification model constructed in advance the assemblage characteristic data, root The testing result of the text to be identified is determined according to processing result.
11. a kind of text detection device as claimed in claim 10, which is characterized in that the detection module includes:
Model training module, for obtaining the historical data of identification text, and acquisition use associated with the identification text The behavioural characteristic data of family predefined type;It is also used to the text feature data and behavioural characteristic data according to the predetermined party Formula is combined, and produces sample characteristics data;And by the sample characteristics data in the machine learning training pattern of selection It is trained, obtains the assemblage characteristic identification model.
12. a kind of text detection device as claimed in claim 11, which is characterized in that the behavioral data of the predefined type Including identifying the sender of text in the data information for identifying the default behavior type generated before and after text transmission.
13. a kind of text detection device as claimed in claim 12, which is characterized in that the text feature abstraction module obtains Text feature data include:
The information content of the identification text is mapped to after higher dimensional space and generates the vector that length is n dimension, n >=1;
And the behavioural characteristic abstraction module acquisition behavioural characteristic data include:
It is coordinate value on corresponding vector dimension by the value of vector dimension, the behavioral data of predefined type of the predefined type Generate the vector that length is m dimension, m >=1.
14. a kind of text detection device as claimed in claim 13, which is characterized in that the feature combination module includes following At least one of:
Merging features module, the text feature data for tieing up n are mutually spliced with the behavioural characteristic data that m is tieed up, and generate length Degree is the assemblage characteristic data of (n+m) dimension;
Dimensional characteristics computing module, for the text feature data of the identification text and behavioural characteristic data to be corresponded to dimension On value carry out operation, obtain the assemblage characteristic data in the corresponding dimension.
15. a kind of text detection device as claimed in claim 11, which is characterized in that the text to be identified includes in following At least one text type:
Inquiry information, e-mail messages, comment, message, RFQ information, instant messaging chat record, attachment.
16. a kind of text detection device characterized by comprising
First abstraction module generates text feature data for the information content based on inquiry to be identified;
Second abstraction module generates behavioural characteristic for obtaining the behavioral data of user associated with the inquiry to be identified Data;
Feature combination module, for will include the text feature data and behavioural characteristic data data according to predetermined way into Row combination, generates assemblage characteristic data;
Inquiry detection module, for being carried out using the rubbish inquiry identification model constructed offline in advance to the assemblage characteristic data Processing, judges whether the inquiry to be identified is rubbish inquiry according to processing result.
17. a kind of text detection device, which is characterized in that including processor and depositing for storage processor executable instruction Reservoir, the processor may be implemented when executing described instruction:
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result The testing result of the text to be identified.
18. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that described instruction is performed When perform the steps of
The information content based on text to be identified generates text feature data;
Obtain the behavioral data of user's predefined type associated with the text to be identified, generation behavioural characteristic data;
Data including the text feature data and behavioural characteristic data are combined according to predetermined way, it is special to generate combination Levy data;
The assemblage characteristic data are handled using the assemblage characteristic identification model constructed in advance, are determined according to processing result The testing result of the text to be identified.
19. a kind of rubbish text detection system, which is characterized in that including text described in any one of claim 10 to 17 Detection device.
20. a kind of Method for text detection, which is characterized in that the described method includes:
The information content based on text to be identified generates Text eigenvector;
Obtain the behavioral data of user associated with the text to be identified, generation behavioural characteristic vector;
Data including the Text eigenvector and behavioural characteristic vector are combined according to predetermined way, it is special to generate combination Levy vector;
The assemblage characteristic vector is handled using the assemblage characteristic identification model constructed in advance, is determined according to processing result The testing result of the text to be identified.
CN201710549655.8A 2017-07-07 2017-07-07 A kind of Method for text detection, apparatus and system Pending CN109213859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710549655.8A CN109213859A (en) 2017-07-07 2017-07-07 A kind of Method for text detection, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710549655.8A CN109213859A (en) 2017-07-07 2017-07-07 A kind of Method for text detection, apparatus and system

Publications (1)

Publication Number Publication Date
CN109213859A true CN109213859A (en) 2019-01-15

Family

ID=64991074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710549655.8A Pending CN109213859A (en) 2017-07-07 2017-07-07 A kind of Method for text detection, apparatus and system

Country Status (1)

Country Link
CN (1) CN109213859A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347797A (en) * 2019-07-10 2019-10-18 广州市百果园信息技术有限公司 Method for detecting, system, equipment and the storage medium of text information
CN110502614A (en) * 2019-08-16 2019-11-26 阿里巴巴集团控股有限公司 Text hold-up interception method, device, system and equipment
CN110705250A (en) * 2019-09-23 2020-01-17 义语智能科技(广州)有限公司 Method and system for identifying target content in chat records
CN111259140A (en) * 2020-01-13 2020-06-09 长沙理工大学 False comment detection method based on LSTM multi-entity feature fusion
CN111400714A (en) * 2020-04-16 2020-07-10 Oppo广东移动通信有限公司 Virus detection method, device, equipment and storage medium
CN111416812A (en) * 2020-03-16 2020-07-14 深信服科技股份有限公司 Malicious script detection method, equipment and storage medium
CN112784007A (en) * 2020-07-16 2021-05-11 上海芯翌智能科技有限公司 Text matching method and device, storage medium and computer equipment
CN112990172A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device
CN113971400A (en) * 2020-07-24 2022-01-25 北京字节跳动网络技术有限公司 Text detection method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103246655A (en) * 2012-02-03 2013-08-14 腾讯科技(深圳)有限公司 Text categorizing method, device and system
CN103853841A (en) * 2014-03-19 2014-06-11 北京邮电大学 Method for analyzing abnormal behavior of user in social networking site
CN104951542A (en) * 2015-06-19 2015-09-30 百度在线网络技术(北京)有限公司 Method and device for recognizing class of social contact short texts and method and device for training classification models
US20150317078A1 (en) * 2008-01-09 2015-11-05 Apple Inc. Method, device, and graphical user interface providing word recommendations for text input
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text
US20160124941A1 (en) * 2014-11-04 2016-05-05 Fujitsu Limited Translation device, translation method, and non-transitory computer readable recording medium having therein translation program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150317078A1 (en) * 2008-01-09 2015-11-05 Apple Inc. Method, device, and graphical user interface providing word recommendations for text input
CN103246655A (en) * 2012-02-03 2013-08-14 腾讯科技(深圳)有限公司 Text categorizing method, device and system
CN103853841A (en) * 2014-03-19 2014-06-11 北京邮电大学 Method for analyzing abnormal behavior of user in social networking site
US20160124941A1 (en) * 2014-11-04 2016-05-05 Fujitsu Limited Translation device, translation method, and non-transitory computer readable recording medium having therein translation program
CN104951542A (en) * 2015-06-19 2015-09-30 百度在线网络技术(北京)有限公司 Method and device for recognizing class of social contact short texts and method and device for training classification models
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347797A (en) * 2019-07-10 2019-10-18 广州市百果园信息技术有限公司 Method for detecting, system, equipment and the storage medium of text information
CN110502614A (en) * 2019-08-16 2019-11-26 阿里巴巴集团控股有限公司 Text hold-up interception method, device, system and equipment
CN110502614B (en) * 2019-08-16 2023-05-09 创新先进技术有限公司 Text interception method, device, system and equipment
CN110705250A (en) * 2019-09-23 2020-01-17 义语智能科技(广州)有限公司 Method and system for identifying target content in chat records
CN112990172B (en) * 2019-12-02 2023-12-22 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device
CN112990172A (en) * 2019-12-02 2021-06-18 阿里巴巴集团控股有限公司 Text recognition method, character recognition method and device
CN111259140A (en) * 2020-01-13 2020-06-09 长沙理工大学 False comment detection method based on LSTM multi-entity feature fusion
CN111259140B (en) * 2020-01-13 2023-07-28 长沙理工大学 False comment detection method based on LSTM multi-entity feature fusion
CN111416812B (en) * 2020-03-16 2022-06-21 深信服科技股份有限公司 Malicious script detection method, equipment and storage medium
CN111416812A (en) * 2020-03-16 2020-07-14 深信服科技股份有限公司 Malicious script detection method, equipment and storage medium
CN111400714A (en) * 2020-04-16 2020-07-10 Oppo广东移动通信有限公司 Virus detection method, device, equipment and storage medium
CN111400714B (en) * 2020-04-16 2023-06-02 Oppo广东移动通信有限公司 Virus detection method, device, equipment and storage medium
CN112784007B (en) * 2020-07-16 2023-02-21 上海芯翌智能科技有限公司 Text matching method and device, storage medium and computer equipment
CN112784007A (en) * 2020-07-16 2021-05-11 上海芯翌智能科技有限公司 Text matching method and device, storage medium and computer equipment
WO2022017299A1 (en) * 2020-07-24 2022-01-27 北京字节跳动网络技术有限公司 Text inspection method and apparatus, electronic device, and storage medium
CN113971400B (en) * 2020-07-24 2023-07-25 抖音视界有限公司 Text detection method and device, electronic equipment and storage medium
CN113971400A (en) * 2020-07-24 2022-01-25 北京字节跳动网络技术有限公司 Text detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109213859A (en) A kind of Method for text detection, apparatus and system
Salminen et al. Creating and detecting fake reviews of online products
Wu et al. Ai-generated content (aigc): A survey
Wu et al. OpinionSeer: interactive visualization of hotel customer feedback
CN103377262B (en) The method and apparatus being grouped to user
CN107153641A (en) Comment information determines method, device, server and storage medium
CN108701118A (en) Semantic classes is classified
CN107977415A (en) Automatic question-answering method and device
CN108345692A (en) A kind of automatic question-answering method and system
CN110019812A (en) A kind of user is from production content detection algorithm and system
CN106776936A (en) intelligent interactive method and system
CN108416028A (en) A kind of method, apparatus and server of search content resource
CN106886518A (en) A kind of method of microblog account classification
CN107657056A (en) Method and apparatus based on artificial intelligence displaying comment information
CN102930048B (en) Use the data rich found automatically with reference to the semanteme with vision data
CN109783539A (en) Usage mining and its model building method, device and computer equipment
CN108256537A (en) A kind of user gender prediction method and system
KR20200143991A (en) Answer recommendation system and method based on text content and emotion analysis
CN110197389A (en) A kind of user identification method and device
CN110362663A (en) Adaptive more perception similarity detections and parsing
CN108304374A (en) Information processing method and related product
Radovanović et al. Review spam detection using machine learning
Ishikawa et al. Audio-visual hybrid approach for filling mass estimation
KR20210058525A (en) Method and device for classifying unstructured item data automatically for goods or services
CN116861258A (en) Model processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190115

RJ01 Rejection of invention patent application after publication