CN110019763A - Text filtering method, system, equipment and computer readable storage medium - Google Patents

Text filtering method, system, equipment and computer readable storage medium Download PDF

Info

Publication number
CN110019763A
CN110019763A CN201711449882.XA CN201711449882A CN110019763A CN 110019763 A CN110019763 A CN 110019763A CN 201711449882 A CN201711449882 A CN 201711449882A CN 110019763 A CN110019763 A CN 110019763A
Authority
CN
China
Prior art keywords
rubbish
text
text data
target text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711449882.XA
Other languages
Chinese (zh)
Other versions
CN110019763B (en
Inventor
陆韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201711449882.XA priority Critical patent/CN110019763B/en
Publication of CN110019763A publication Critical patent/CN110019763A/en
Application granted granted Critical
Publication of CN110019763B publication Critical patent/CN110019763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Abstract

The invention discloses a kind of text data filtering method, system, equipment and computer readable storage mediums, wherein the described method includes: creation rubbish text information bank, the rubbish text information bank are stored with an at least rubbish text data;Feature extraction is carried out to rubbish text data, rubbish text feature vector is generated, in conjunction with the weight training rubbish text prediction model of each feature;Feature extraction is carried out to target text data, generates target text feature vector, target text feature vector is inputted into rubbish text prediction model, to calculate target text data as the probability of rubbish text data;It whether is rubbish text data according to probabilistic determination target text data.The present invention can make up for it the deficiency for leading to the more resource of occupancy excessive to the viscosity of administrator by the publication such as manual examination and verification management forum, community or discussion bar content in the prior art, intelligently filtering belongs to the target text data of rubbish text data, improves identification effect.

Description

Text filtering method, system, equipment and computer readable storage medium
Technical field
The present invention relates to text-processing field more particularly to a kind of text filtering method, system, equipment and computer-readable Storage medium.
Background technique
On present network there are the diversified forms such as many forums, community or discussion bar, deliver for people itself view or The website of comment or channel, this kind of website or channel are while providing the space of free speech to people, it is also possible to occur one A little skimble-skamble comment spams are related to the improper speech of sensitive theme, therefore, provide this kind of website or channel appropriate Supervision is also very necessary.
Supervision method at this stage be usually with webmaster preset keyword to forum's content, community's article The carry out such as content artificial screening and filtering are perhaps commented in content, model, delete skimble-skamble junk information or sensitive letter Breath.
This supervision method is very dependent on manual examination and verification management.Administrator needs to browse forum, community or patch in real time Etc., for more popular content since browsing personnel's number is excessive, information content is larger, administrator is difficult to filter one by one, very It is easy error, it is excessive to the viscosity of administrator, occupy more resource.
Summary of the invention
The technical problem to be solved by the present invention is in order to overcome in the prior art by manual examination and verification management forum, community or Discussion bar etc. issues content and leads to the sticky defect that is excessive, occupying more resource to administrator, and providing one kind being capable of automatic mistake Filter text filtering method, system, equipment and the computer readable storage medium of rubbish text.
The present invention is to solve above-mentioned technical problem by the following technical programs:
The present invention provides a kind of text data filtering method, its main feature is that, the text data filtering method includes:
Rubbish text information bank is created, the rubbish text information bank is stored with an at least rubbish text data;
Feature extraction is carried out to the rubbish text data, rubbish text feature vector is generated, in conjunction with the power of each feature Retraining rubbish text prediction model;
To target text data carry out feature extraction, generate target text feature vector, by the target text feature to Amount inputs the rubbish text prediction model, to calculate the target text data as the probability of rubbish text data;
It whether is rubbish text data according to target text data described in the probabilistic determination.
Preferably, the rubbish text data include rubbish text content, the target text data include target text Content;
Feature extraction is carried out to the rubbish text data, comprising: the rubbish text content, which is switched to numeralization, to be indicated;
Feature extraction is carried out to the target text data, comprising: the target text content is switched into digitized representations.
Preferably, the rubbish text content, which is switched to numeralization, to be indicated, comprising:
From the rubbish text contents extraction keyword;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector, described the One space vector is when generating rubbish text feature vector as the rubbish text feature vector or the rubbish text feature The value of partial dimensional in vector;
The target text content, which is switched to numeralization, to be indicated, comprising:
From the target text contents extraction keyword;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector, described the Two space vectors are when generating target text feature vector as the target text feature vector or the target text feature The value of partial dimensional in vector.
Preferably, the rubbish text data include rubbish text issuing time, the target text data include target Text issuing time;
Feature extraction is carried out to the rubbish text data, further includes: the rubbish text issuing time is switched into numerical value Changing indicates;
Feature extraction is carried out to the target text data, further includes: the target text issuing time is switched into numerical value Changing indicates.
Preferably, the rubbish text issuing time, which is switched to numeralization, to be indicated, comprising:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding number of the first time period Value amount, the corresponding numerical quantities of the first time period are when generating rubbish text feature vector as the characteristics of spam vector The value of one dimension after merging with first space vector, forms the rubbish text feature vector;
The target text issuing time, which is switched to numeralization, to be indicated, comprising:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine institute State the corresponding numerical quantities of second time period, the corresponding numerical quantities of the second time period are made when generating target text feature vector For the value of a dimension of the target text feature vector, after merging with the second space vector, the target text is formed Eigen vector.
Preferably, the weight of each feature is calculated by ReliefF algorithm, the rubbish text prediction model is based on The training of ReliefF algorithm forms.
Preferably, the text data filtering method further include:
The target text data for being determined as rubbish text data are manually verified;
And/or the target text data or warp that will be determined as rubbish text data using the rubbish text prediction model Cross the artificial target text data deposit rubbish text information bank verified and be confirmed as rubbish text data.
The present invention also provides a kind of text data filtering systems, its main feature is that, the text data filtering system includes: number According to unit, model unit and judging unit;
For the data cell for creating rubbish text information bank, the rubbish text information bank is stored with an at least rubbish Text data;
The model unit includes:
Fisrt feature extraction module, for carrying out feature extraction to the rubbish text data;
First eigenvector module, for generating rubbish text feature vector;
Model training module, for combining the weight training rubbish text prediction model of each feature;
The judging unit includes:
Second feature extraction module, for carrying out feature extraction to target text data;
Second feature vector module, for generating target text feature vector;
Probability evaluation entity, for the target text feature vector to be inputted the rubbish text prediction model, in terms of The probability that the target text data are rubbish text data is calculated, and the target text data according to the probabilistic determination are No is rubbish text data.
Preferably, the rubbish text data include rubbish text content, the target text data include target text Content;
The fisrt feature extraction module is used to switching to the rubbish text content into numeralization to indicate;
The second feature extraction module is used to the target text content switching to digitized representations.
Preferably, the rubbish text content, which is switched to numeralization, to be indicated, comprising:
From the rubbish text contents extraction keyword;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector, described the One space vector is when generating rubbish text feature vector as the rubbish text feature vector or the rubbish text feature The value of partial dimensional in vector;
The target text content, which is switched to numeralization, to be indicated, comprising:
From the target text contents extraction keyword;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector, described the Two space vectors are when generating target text feature vector as the target text feature vector or the target text feature The value of partial dimensional in vector.
Preferably, the rubbish text data include rubbish text issuing time, the target text data include target Text issuing time;
The fisrt feature extraction module, which is also used to switch to the rubbish text issuing time numeralization, to be indicated;
The second feature extraction module, which is also used to switch to the target text issuing time numeralization, to be indicated.
Preferably, the rubbish text issuing time, which is switched to numeralization, to be indicated, comprising:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding number of the first time period Value amount, the corresponding numerical quantities of the first time period are when generating rubbish text feature vector as the characteristics of spam vector The value of one dimension after merging with first space vector, forms the rubbish text feature vector;
The target text issuing time, which is switched to numeralization, to be indicated, comprising:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine institute State the corresponding numerical quantities of second time period, the corresponding numerical quantities of the second time period are made when generating target text feature vector For the value of a dimension of the target text feature vector, after merging with the second space vector, the target text is formed Eigen vector.
Preferably, the weight of each feature is calculated by ReliefF algorithm, described in the model training module Rubbish text prediction model is based on the training of ReliefF algorithm and forms.
Preferably, the text data filtering system further include:
Verification unit, for being determined as that the target text data of rubbish text data are manually verified;
And/or storage unit, for the rubbish text prediction model will to be utilized to be determined as the target of rubbish text data Text data is stored in the rubbish text information bank by the target text data that rubbish text data are confirmed as in artificial verification.
The present invention also provides a kind of electronic equipment, including memory, processor and storage on a memory and can handled The computer program run on device, its main feature is that, the processor realizes that above-mentioned each optimum condition is any when executing described program Combined text data filtering method.
The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, its main feature is that, it is described The step of text data filtering method of above-mentioned each optimum condition any combination is realized when program is executed by processor.
On the basis of common knowledge of the art, above-mentioned each optimum condition, can any combination to get each preferable reality of the present invention Example.
The positive effect of the present invention is that: the present invention can be according to the rubbish text data in rubbish text information bank Training rubbish text prediction model, and the target for belonging to rubbish text data is intelligently filtered using rubbish text prediction model Text data reduces the viscosity to administrator, reduces and occupies resource, improves identification effect.
Detailed description of the invention
Fig. 1 is the flow chart of the text data filtering method of present pre-ferred embodiments 1
Fig. 2 is the flow chart of step 102 in the text data filtering method of present pre-ferred embodiments 1.
Fig. 3 is the flow chart of step 103 in the text data filtering method of present pre-ferred embodiments 1.
Fig. 4 is the schematic block diagram of the text data filtering system of present pre-ferred embodiments 2.
Fig. 5 is the hardware structural diagram of the electronic equipment of present pre-ferred embodiments 3.
Specific embodiment
The present invention is further illustrated below by the mode of embodiment, but does not therefore limit the present invention to the reality It applies among a range.
Embodiment 1
Fig. 1 shows the flow chart of the text data filtering method of the present embodiment.The text data filtering method is main For judging whether target text data are rubbish text data, to realize the rubbish text data of filtering publication.Generally, The rubbish text data refer to its content belong to it is meaningless or be related to sensitive theme, be not suitable for issue on a public occasion Or any type of text such as comment, model, article delivered.
The text filtering method the following steps are included:
Step 101, creation rubbish text information bank, the rubbish text information bank are stored with an at least rubbish text number According to.The rubbish text information bank is formed by collecting the rubbish text data of history, can specifically be built in the form of database It is vertical.
Step 102 carries out feature extraction to the rubbish text data, rubbish text feature vector is generated, in conjunction with each The weight training rubbish text prediction model of feature.Wherein, whether the rubbish text prediction model is for predicting text data For rubbish text data.
Step 103 carries out feature extraction to target text data, generates target text feature vector, by the target text Eigen vector inputs the rubbish text prediction model, to calculate the target text data as the general of rubbish text data Rate.Wherein, the target text data can be the comment issued or delivered on forum, community, discussion bar or other websites, note Any type of text such as son, article or other texts.
Whether step 104, the target text data according to the probabilistic determination are rubbish text data.
Rubbish text data include rubbish text issuing time and rubbish text content in the present embodiment, but the present invention is simultaneously It is not limited to this, it can also include other relevant informations, such as account, the IP of publication rubbish text.Following table gives one kind and deposits Store up the adoptable specific format of data:
Below by taking rubbish text data include rubbish text content and rubbish text issuing time as an example, step 102 is done Further illustrate, as shown in Fig. 2, step 102 specifically includes the following steps:
The rubbish text content is switched to numeralization expression and turns the rubbish text issuing time by step 1021 It is indicated for numeralization.The feature extraction to the rubbish text data is realized with this.
Wherein, the rubbish text content is switched into the detailed process that numeralization indicates are as follows:
From the rubbish text contents extraction keyword, wherein the keyword is usually some indecency or is related to quick The word feeling the word of theme or often often occurring in rubbish text content presets and has formulated fixed unique Indexed sequential;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector.
It in the specific implementation, can be by Word2vec model realization from the rubbish text contents extraction keyword, also It can realize otherwise according to actual needs from the rubbish text contents extraction keyword.Word2vec is a by word It is characterized as the efficient tool of real number value vector, utilizes the thought of deep learning, it can be by training the place to content of text The vector operation being reduced in K dimensional vector space is managed, converts the word in sentence to the successive value of low dimensional, and its is favorite Think similar word and is mapped to similar position in vector space, and the similarity in vector space can be used to indicate text semantic On similarity.Its basic thought is assumed that for a text, is ignored its word order and grammer, syntax, is only regarded as It is the set of some words, and each word of text is independent.It is to eliminate using the advantages of Word2vec model The hidden layer of neural network, reduces calculation amount.
Assuming that two simple texts are as follows:
John likes to watch movies.Mary likes too.
John also likes to watch football games.
Based on the word occurred in above-mentioned two document, such as next dictionary (dictionary) is constructed:
{"John":1,"likes":2,"to":3,"watch":4,"movies":5,"also":6,"football": 7,"games":8,"Mary":9,"too":10}
It include 10 words in dictionary above, each word has unique index, then each text can be used one The vectors of a 10 dimension indicate.It is as follows:
[1,2,1,1,1,0,0,0,1,1]
[1,1,1,1,0,1,1,1,0,0]
The vector of generation and the vocabulary appearance sequence in original text are not related, and expression is each word in correspondence Text in the number that occurs.
Each keyword as feature, is obtained expression of each keyword in vector space, by rubbish by the present embodiment Content of text is ultimately converted to the first space vector.
The rubbish text issuing time is switched into the detailed process that numeralization indicates are as follows:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding number of the first time period Value amount.
Wherein, the concentrative time interval that the period can freely divide or issue in conjunction with rubbish text in previous experiences divides, Each period, corresponding numerical quantities also can freely be set.In the present embodiment, one day time was divided into 4 periods, In,
0:00~10:00 is the daystart period, and it is 0 that corresponding numerical quantities, which are arranged,;
10:00~14:00 is the period at noon, and it is 1 that corresponding numerical quantities, which are arranged,;
14:00~19:00 is afternoon hours, and it is 2 that corresponding numerical quantities, which are arranged,;
19:00~24:00 is the period in the evening, and it is 3 that corresponding numerical quantities, which are arranged,.
If the rubbish text issuing time of a rubbish text data is 11:00, the rubbish text issuing time The affiliated period is 10:00~14:00, and corresponding numerical quantities are then 1.
Step 1022, using the corresponding numerical quantities of the first time period as a dimension of the characteristics of spam vector Value, after merging with first space vector, forms the rubbish text feature vector.The rubbish text feature is realized with this The generation of vector.
Such as a rubbish text data, it is issued in 7:00, then corresponding feature vector are as follows:
[0,1,2,3,2,1,0,4 ...], wherein first digit 0 is to represent rubbish text issuing time, subsequent number Represent the first space vector made of rubbish text content is converted by Word2Vec.
Certainly, if only including rubbish text content in rubbish text data without including rubbish text issuing time, then It can be directly using first space vector as rubbish text feature vector;If in rubbish text data further including other phases Information is closed, then it is pre- to participate in rubbish text for the value after can also being quantized as partial dimensional in rubbish text feature vector Survey the calculating of model.
Step 1023, the weight training rubbish text prediction model in conjunction with each feature.In the present embodiment, each feature Weight is calculated especially by ReliefF algorithm and is obtained, and the weight of each feature is saved in the model of ReliefF algorithm In, train the rubbish text prediction model based on ReliefF algorithm.Certainly other algorithms can be used also to calculate each feature Weight and the corresponding algorithm model of training.
The correlation of feature and classification is the separating capacity based on feature to short distance sample in ReliefF algorithm.Algorithm A sample R is randomly choosed from training set D, and nearest samples H, referred to as Near are then found from the sample similar with R Hit finds nearest samples M, referred to as NearMiss from the inhomogeneous sample of R, then each according to following Policy Updates The weight of feature: if R and Near Hit is less than the distance on R and Near Miss in the distance in some feature, illustrate this Feature to distinguish similar and inhomogeneous arest neighbors be it is beneficial, then increase the weight of this feature;, whereas if R and Near Hit some feature distance be greater than R and Near Miss on distance, illustrate this feature to distinguish it is similar and it is inhomogeneous most Neighbour plays negative effect, then reduces the weight of this feature.Above procedure Repeated m time, finally obtains the average weight of each feature.It is special The weight of sign is bigger, indicates that the classification capacity of this feature is stronger, conversely, indicating that this feature classification capacity is weaker.In more classification texts In this, training can all randomly select a sample R from sample set every time, be then based in the similar sample set of sample and find out K A neighbour's sample (near Hits), finds out k neighbour's sample (near from the inhomogeneous sample set of each R Misses), the weight of each feature is then updated.
The runing time of ReliefF algorithm linearly increases with the increase of the frequency in sampling m and primitive character number N of sample Add, thus operational efficiency is very high.
Target text data include target text issuing time and target text content in the present embodiment, but the present invention is simultaneously It is not limited to this, it can also include other relevant informations, such as account, the IP of publication target text.
Below by taking target text data include target text content and target text issuing time as an example, step 103 is done Further illustrate, as shown in figure 3, step 103 specifically includes the following steps:
The target text content is switched to digitized representations and turns the target text issuing time by step 1031 It is indicated for numeralization.The feature extraction to the target text data is realized with this.
Wherein, the target text content is switched into the detailed process that numeralization indicates are as follows:
From the target text contents extraction keyword, the keyword it is identical as the keyword being arranged in step 1021 and Indexed sequential having the same;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector.
In the specific implementation, it again may be by Word2vec to realize from the target text contents extraction keyword, also It can realize otherwise according to actual needs from the rubbish text contents extraction keyword.Form second space vector Detailed process can be with reference to the process for forming the first space vector, and details are not described herein.
The target text issuing time is switched into the detailed process that numeralization indicates are as follows:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine institute State the corresponding numerical quantities of second time period.
Step 1032, using the corresponding numerical quantities of the second time period as one of target text feature vector dimension The value of degree after merging with the second space vector, forms the target text feature vector.The target text is realized with this The generation of feature vector.
Such as target text data, it is issued in 18:00, then corresponding feature vector are as follows:
[2,1,3,0,1,2,0,4 ...], wherein first digit 2 is to represent target text issuing time, subsequent number Represent second space vector made of target text content is converted by Word2Vec.
Certainly, if only including target text content in target text data without including target text issuing time, then It can be directly using the second space vector as target text feature vector;If in target text data further including other phases Information is closed, then the value after can also being quantized as partial dimensional in target text feature vector ultimately forms target text Eigen vector.
The target text feature vector is inputted the rubbish text prediction model and computation model output by step 1033 Amount, the model output represent the probability that the target text data are rubbish text data.
In addition, specifically can be set in step 104 are as follows: judge whether the probability is greater than probability threshold value, if so, determining The target text data are rubbish text data, if it is not, then determining the target text data for non-junk text data.Its In, the probability threshold value can sets itself, probability threshold value setting it is bigger, then be determined as that the requirement of rubbish text data is tighter Lattice are then determined as that the requirement of rubbish text data is looser conversely, probability threshold value setting is smaller.
For being determined as that the target text data of rubbish text data can automatically delete it, or pass through management Member is handled.
Whether the judging result in order to further confirm that step 104 is correct, and the text data filtering method can also be into One step includes after step 104:
The target text data for being determined as rubbish text data are manually verified.Rubbish is mistaken for for manually determination The reason of target data of rubbish text data corrects judging result, and retrospect is judged by accident, further corrects rubbish text prediction model, mentions The accuracy of height judgement.
In order to collect more rubbish text data, expand rubbish text information bank, the text data filtering method is also It may further include after step 104:
It will be determined as the target text data of rubbish text data using the rubbish text prediction model or by artificial Verify the target text data deposit rubbish text information bank for being confirmed as rubbish text data.
Embodiment 2
Fig. 4 shows the schematic block diagram of the text data filtering system of the present embodiment.The text data filtering system master It is used to judge whether target text data to be rubbish text data, to realize the rubbish text data of filtering publication.
The text data filtering system includes: data cell 201, model unit 202 and judging unit 203.
For the data cell 201 for creating rubbish text information bank, the rubbish text information bank is stored at least one Rubbish text data.The rubbish text information bank is formed by collecting the rubbish text data of history, specifically can be with data The form in library is established.Rubbish text data include rubbish text issuing time and rubbish text content in the present embodiment, but this Invention is not limited thereto, and can also include other relevant informations, such as account, the IP of publication rubbish text.
The model unit 202 includes: fisrt feature extraction module 2021, first eigenvector module 2022 and model instruction Practice module 2023.
The fisrt feature extraction module 2021 is used to carry out feature extraction to the rubbish text data.
The first eigenvector module 2022 is for generating rubbish text feature vector.
The model training module 2023 is used to combine the weight training rubbish text prediction model of each feature.Wherein, The rubbish text prediction model is for predicting whether text data is rubbish text data.
The judging unit includes: second feature extraction module 2031, second feature vector module 2032 and probability calculation Module 2033.
The second feature extraction module 2031 is used to carry out feature extraction to target text data.Wherein, the target Text data can be any forms such as comment, model, the article issued or delivered on forum, community, discussion bar or other websites Text or other texts.Target text data include target text issuing time and target text content in the present embodiment, But the present invention is not limited thereto, can also include other relevant informations, such as account, the IP of publication target text.
The second feature vector module 2032 is for generating target text feature vector.
The probability evaluation entity 2033, which is used to the target text feature vector inputting the rubbish text, predicts mould Type, to calculate the probability that the target text data are rubbish text data, and the text of the target according to the probabilistic determination Whether notebook data is rubbish text data.
Below to the fisrt feature extraction module 2021, the first eigenvector module 2022 and the model training Module 2023 is described further:
The rubbish text content is switched to numeralization by the fisrt feature extraction module 2021 to be indicated and by the rubbish Rubbish text issuing time, which switchs to numeralization, to be indicated.The feature extraction to the rubbish text data is realized with this.
Wherein, the rubbish text content is switched to numeralization indicates, comprising:
From the rubbish text contents extraction keyword, wherein the keyword presets and formulated fixed unique Indexed sequential;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector.
The rubbish text issuing time, which is switched to numeralization, to be indicated, comprising:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding number of the first time period Value amount.
The first eigenvector module 2022 is using the corresponding numerical quantities of the first time period as the characteristics of spam The value of one dimension of vector after merging with first space vector, forms the rubbish text feature vector.It is realized with this Feature extraction to the rubbish text data.Certainly, if only including rubbish text content in rubbish text data without wrapping Include rubbish text issuing time, then it can be directly using first space vector as rubbish text feature vector;If rubbish Further include other relevant informations in text data, is then used as in rubbish text feature vector after can also being quantized and partially ties up The value of degree participates in the calculating of rubbish text prediction model.
The model training module 2023 calculates the weight of each feature, the weight of each feature by ReliefF algorithm It is saved in the model of ReliefF algorithm, trains the rubbish text prediction model based on ReliefF algorithm.Certainly also Other algorithms can be used to calculate the weight of each feature and the corresponding algorithm model of training.
Below to the second feature extraction module 2031, the second feature vector module 2032 and the probability calculation Module 2033 is described further:
The target text content is switched to digitized representations and by the mesh by the second feature extraction module 2031 Mark text issuing time, which switchs to numeralization, to be indicated.The feature extraction to the rubbish target text data is realized with this.Wherein, will The target text content, which switchs to numeralization, to be indicated, comprising:
From the target text contents extraction keyword;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector.
In the specific implementation, it again may be by Word2vec to realize from the target text contents extraction keyword, also It can realize otherwise according to actual needs from the rubbish text contents extraction keyword.Form second space vector Detailed process can be with reference to the process for forming the first space vector, and details are not described herein.
The target text issuing time, which is switched to numeralization, to be indicated, comprising:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine institute State the corresponding numerical quantities of second time period.
The second feature vector module 2032 is using the corresponding numerical quantities of the second time period as the target text The value of one dimension of feature vector after merging with the second space vector, forms the target text feature vector.With this Realize the feature extraction to the target text data.Certainly, if only including target text content in target text data It does not include target text issuing time, then it can be directly using the second space vector as target text feature vector;If It further include other relevant informations in target text data, then as in the middle part of target text feature vector after can also being quantized The value of fractional dimension ultimately forms target text feature vector.
The probability evaluation entity 2033, which is used to the target text feature vector inputting the rubbish text, predicts mould Type and computation model output quantity, it is rubbish text data that the model output, which represents and calculates the target text data, Probability determines the target text data for rubbish text data if the probability is greater than probability threshold value.If the probability is not Greater than the probability threshold value, then determine the target text data for non-junk text data.Wherein, the probability threshold value can be certainly Row setting, probability threshold value are set bigger, then are determined as that the requirement of rubbish text data is stringenter, conversely, probability threshold value is set It is smaller, then be determined as that the requirement of rubbish text data is looser.
For being determined as that the target text data of rubbish text data can automatically delete it, or pass through management Member is handled.
Whether the judging result in order to further confirm that the judging unit 203 is correct, the text data filtering system Further include:
Verification unit 204, for being determined as that the target text data of rubbish text data are manually verified.For people The reason of work determines the target data amendment judging result for being mistaken for rubbish text data, and retrospect is judged by accident, further corrects rubbish Rubbish text prediction model, improves the accuracy of judgement.
In order to collect more rubbish text data, expand rubbish text information bank, the text data filtering system is also Include:
Storage unit 205, for the rubbish text prediction model will to be utilized to be determined as the target text of rubbish text data Notebook data is stored in the rubbish text information bank by the target text data that rubbish text data are confirmed as in artificial verification.
Embodiment 3
Fig. 5 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention 3 provides.The electronic equipment includes storage Device, processor and storage on a memory and the computer program that can run on a processor, the processor execution journey The text data filtering method of embodiment 1 is realized when sequence.The electronic equipment 30 that Fig. 5 is shown is only an example, should not be to this The function and use scope of inventive embodiments bring any restrictions.
As shown in figure 5, electronic equipment 30 can be showed in the form of universal computing device, such as it can set for server It is standby.The component of electronic equipment 30 can include but is not limited to: at least one above-mentioned processor 31, above-mentioned at least one processor 32, the bus 33 of different system components (including memory 32 and processor 31) is connected.
Bus 33 includes data/address bus, address bus and control bus.
Memory 32 may include volatile memory, such as random access memory (RAM) 321 and/or cache Memory 322 can further include read-only memory (ROM) 323.
Memory 32 can also include program/utility 325 with one group of (at least one) program module 324, this The program module 324 of sample includes but is not limited to: operating system, one or more application program, other program modules and journey It may include the realization of network environment in ordinal number evidence, each of these examples or certain combination.
Processor 31 by operation storage computer program in memory 32, thereby executing various function application and Data processing, such as text data filtering method provided by the embodiment of the present invention 1.
Electronic equipment 30 can also be communicated with one or more external equipments 34 (such as keyboard, sensing equipment etc.).It is this Communication can be carried out by input/output (I/O) interface 35.Also, the equipment 30 that model generates can also pass through Network adaptation Device 36 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) logical Letter.As shown, the other modules for the equipment 30 that network adapter 36 is generated by bus 33 and model communicate.It should be understood that Although not shown in the drawings, the equipment 30 that can be generated with binding model uses other hardware and/or software module, including but unlimited In: microcode, device driver, redundant processor, external disk drive array, RAID (disk array) system, magnetic tape drive Device and data backup storage system etc..
It should be noted that although being referred to several units/modules or subelement/mould of electronic equipment in the above detailed description Block, but it is this division be only exemplary it is not enforceable.In fact, embodiment according to the present invention, is retouched above The feature and function for two or more units/modules stated can embody in a units/modules.Conversely, above description A units/modules feature and function can with further division be embodied by multiple units/modules.
Embodiment 4
A kind of computer readable storage medium is present embodiments provided, computer program, described program quilt are stored thereon with The step of text data filtering method provided by embodiment 1 is realized when processor executes.
Wherein, what readable storage medium storing program for executing can use more specifically can include but is not limited to: portable disc, hard disk, random Access memory, read-only memory, erasable programmable read only memory, light storage device, magnetic memory device or above-mentioned times The suitable combination of meaning.
In possible embodiment, the present invention is also implemented as a kind of form of program product comprising program generation Code, when described program product is run on the terminal device, said program code is realized in fact for executing the terminal device Apply the step in text data filtering method described in example 1.
Wherein it is possible to be write with any combination of one or more programming languages for executing program of the invention Code, said program code can be executed fully on a user device, partly execute on a user device, is only as one Vertical software package executes, part executes on a remote device or executes on a remote device completely on a user device for part.
Although specific embodiments of the present invention have been described above, it will be appreciated by those of skill in the art that these It is merely illustrative of, protection scope of the present invention is defined by the appended claims.Those skilled in the art is not carrying on the back Under the premise of from the principle and substance of the present invention, many changes and modifications may be made, but these are changed Protection scope of the present invention is each fallen with modification.

Claims (16)

1. a kind of text data filtering method, which is characterized in that the text data filtering method includes:
Rubbish text information bank is created, the rubbish text information bank is stored with an at least rubbish text data;
Feature extraction is carried out to the rubbish text data, generates rubbish text feature vector, is instructed in conjunction with the weight of each feature Practice rubbish text prediction model;
Feature extraction is carried out to target text data, generates target text feature vector, the target text feature vector is defeated Enter the rubbish text prediction model, to calculate the target text data as the probability of rubbish text data;
It whether is rubbish text data according to target text data described in the probabilistic determination.
2. text data filtering method as described in claim 1, which is characterized in that the rubbish text data include rubbish text This content, the target text data include target text content;
Feature extraction is carried out to the rubbish text data, comprising: the rubbish text content, which is switched to numeralization, to be indicated;
Feature extraction is carried out to the target text data, comprising: the target text content is switched into digitized representations.
3. text data filtering method as claimed in claim 2, which is characterized in that the rubbish text content is switched to numerical value Changing indicates, comprising:
From the rubbish text contents extraction keyword;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector, described first is empty Between vector when generating rubbish text feature vector as the rubbish text feature vector or the rubbish text feature vector In partial dimensional value;
The target text content, which is switched to numeralization, to be indicated, comprising:
From the target text contents extraction keyword;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector, described second is empty Between vector when generating target text feature vector as the target text feature vector or the target text feature vector In partial dimensional value.
4. text data filtering method as claimed in claim 3, which is characterized in that the rubbish text data include rubbish text This issuing time, the target text data include target text issuing time;
Feature extraction is carried out to the rubbish text data, further includes: the rubbish text issuing time is switched into numeralization table Show;
Feature extraction is carried out to the target text data, further includes: the target text issuing time is switched into numeralization table Show.
5. text data filtering method as claimed in claim 4, which is characterized in that switch to the rubbish text issuing time Numeralization indicates, comprising:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding numerical value of the first time period Amount, the corresponding numerical quantities of the first time period when generating rubbish text feature vector as the characteristics of spam vector one The value of a dimension after merging with first space vector, forms the rubbish text feature vector;
The target text issuing time, which is switched to numeralization, to be indicated, comprising:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine described the Two periods corresponding numerical quantities, the corresponding numerical quantities of the second time period are when generating target text feature vector as institute After merging with the second space vector, it is special to form the target text for the value for stating a dimension of target text feature vector Levy vector.
6. text data filtering method as described in claim 1, which is characterized in that the weight of each feature passes through ReliefF Algorithm is calculated and is obtained, and the rubbish text prediction model is based on the training of ReliefF algorithm and forms.
7. text data filtering method as described in claim 1, which is characterized in that the text data filtering method also wraps It includes:
The target text data for being determined as rubbish text data are manually verified;
And/or it will be determined as that the target text data of rubbish text data or warp are remarkable using the rubbish text prediction model Work verifies the target text data deposit rubbish text information bank for being confirmed as rubbish text data.
8. a kind of text data filtering system, which is characterized in that the text data filtering system includes: data cell, model Unit and judging unit;
For the data cell for creating rubbish text information bank, the rubbish text information bank is stored with an at least rubbish text Data;
The model unit includes:
Fisrt feature extraction module, for carrying out feature extraction to the rubbish text data;
First eigenvector module, for generating rubbish text feature vector;
Model training module, for combining the weight training rubbish text prediction model of each feature;
The judging unit includes:
Second feature extraction module, for carrying out feature extraction to target text data;
Second feature vector module, for generating target text feature vector;
Probability evaluation entity, for the target text feature vector to be inputted the rubbish text prediction model, to calculate State the probability that target text data are rubbish text data, and the target text data according to the probabilistic determination whether be Rubbish text data.
9. text data filtering system as claimed in claim 8, which is characterized in that the rubbish text data include rubbish text This content, the target text data include target text content;
The fisrt feature extraction module is used to switching to the rubbish text content into numeralization to indicate;
The second feature extraction module is used to the target text content switching to digitized representations.
10. text data filtering system as claimed in claim 9, which is characterized in that switch to count the rubbish text content Value indicates, comprising:
From the rubbish text contents extraction keyword;
Count the number that each keyword occurs in the rubbish text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms the first space vector, described first is empty Between vector when generating rubbish text feature vector as the rubbish text feature vector or the rubbish text feature vector In partial dimensional value;
The target text content, which is switched to numeralization, to be indicated, comprising:
From the target text contents extraction keyword;
Count the number that each keyword occurs in the target text content;
The number that each keyword occurs is listed according to the indexed sequential of keyword, forms second space vector, described second is empty Between vector when generating target text feature vector as the target text feature vector or the target text feature vector In partial dimensional value.
11. text data filtering system as claimed in claim 10, which is characterized in that the rubbish text data include rubbish Text issuing time, the target text data include target text issuing time;
The fisrt feature extraction module, which is also used to switch to the rubbish text issuing time numeralization, to be indicated;
The second feature extraction module, which is also used to switch to the target text issuing time numeralization, to be indicated.
12. text data filtering system as claimed in claim 11, which is characterized in that turn the rubbish text issuing time It is indicated for numeralization, comprising:
Some time is divided, and a numerical quantities are respectively set for each period;
Judge first time period belonging to the rubbish text issuing time, and determines the corresponding numerical value of the first time period Amount, the corresponding numerical quantities of the first time period when generating rubbish text feature vector as the characteristics of spam vector one The value of a dimension after merging with first space vector, forms the rubbish text feature vector;
The target text issuing time, which is switched to numeralization, to be indicated, comprising:
According to the period divided, second time period belonging to the target text issuing time is judged, and determine described the Two periods corresponding numerical quantities, the corresponding numerical quantities of the second time period are when generating target text feature vector as institute After merging with the second space vector, it is special to form the target text for the value for stating a dimension of target text feature vector Levy vector.
13. text data filtering system as claimed in claim 8, which is characterized in that in the model training module, Mei Yite The weight of sign is calculated by ReliefF algorithm, and the rubbish text prediction model is based on the training of ReliefF algorithm and forms.
14. text data filtering system as claimed in claim 8, which is characterized in that the text data filtering system is also wrapped It includes:
Verification unit, for being determined as that the target text data of rubbish text data are manually verified;
And/or storage unit, for the rubbish text prediction model will to be utilized to be determined as the target text of rubbish text data Data are stored in the rubbish text information bank by the target text data that rubbish text data are confirmed as in artificial verification.
15. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes text described in any one of claims 1 to 7 when executing described program Data filtering method.
16. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The step of device realizes text data filtering method described in any one of claims 1 to 7 when executing.
CN201711449882.XA 2017-12-27 2017-12-27 Text filtering method, system, equipment and computer readable storage medium Active CN110019763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711449882.XA CN110019763B (en) 2017-12-27 2017-12-27 Text filtering method, system, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711449882.XA CN110019763B (en) 2017-12-27 2017-12-27 Text filtering method, system, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110019763A true CN110019763A (en) 2019-07-16
CN110019763B CN110019763B (en) 2022-04-12

Family

ID=67187050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711449882.XA Active CN110019763B (en) 2017-12-27 2017-12-27 Text filtering method, system, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110019763B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442875A (en) * 2019-08-12 2019-11-12 北京思维造物信息科技股份有限公司 A kind of text checking method, apparatus and system
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN113538002A (en) * 2020-04-14 2021-10-22 北京沃东天骏信息技术有限公司 Method and device for auditing texts

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
JP2011048488A (en) * 2009-08-25 2011-03-10 Nippon Telegr & Teleph Corp <Ntt> Apparatus, system, method and program for analysis of data flow
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system
CN104111925A (en) * 2013-04-16 2014-10-22 中国移动通信集团公司 Item recommendation method and device
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101106539A (en) * 2007-08-03 2008-01-16 浙江大学 Filtering method for spam based on supporting vector machine
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
JP2011048488A (en) * 2009-08-25 2011-03-10 Nippon Telegr & Teleph Corp <Ntt> Apparatus, system, method and program for analysis of data flow
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN104111925A (en) * 2013-04-16 2014-10-22 中国移动通信集团公司 Item recommendation method and device
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system
CN107256245A (en) * 2017-06-02 2017-10-17 河海大学 Improved and system of selection towards the off-line model that refuse messages are classified

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516066A (en) * 2019-07-23 2019-11-29 同盾控股有限公司 A kind of content of text safety protecting method and device
CN110516066B (en) * 2019-07-23 2022-04-15 同盾控股有限公司 Text content safety protection method and device
CN110442875A (en) * 2019-08-12 2019-11-12 北京思维造物信息科技股份有限公司 A kind of text checking method, apparatus and system
CN113538002A (en) * 2020-04-14 2021-10-22 北京沃东天骏信息技术有限公司 Method and device for auditing texts

Also Published As

Publication number Publication date
CN110019763B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN106874292B (en) Topic processing method and device
CN104778158B (en) A kind of document representation method and device
CN107728874A (en) The method, apparatus and equipment of user prompt operation are provided
CN110532451A (en) Search method and device for policy text, storage medium, electronic device
US9454528B2 (en) Method and system for creating ordered reading lists from unstructured document sets
CN108062573A (en) Model training method and device
CN110580292A (en) Text label generation method and device and computer readable storage medium
CN106940679A (en) Data processing method and device
CN105320957A (en) Classifier training method and device
CN108268617A (en) User view determines method and device
CN107894827B (en) Application cleaning method and device, storage medium and electronic equipment
CN110019763A (en) Text filtering method, system, equipment and computer readable storage medium
CN105512156B (en) Click model generation method and device
CN107678800A (en) Background application method for cleaning, device, storage medium and electronic equipment
CN107391545A (en) A kind of method classified to user, input method and device
CN110347840A (en) Complain prediction technique, system, equipment and the storage medium of text categories
CN106796618A (en) Time series forecasting device and time sequence forecasting method
CN109376270A (en) A kind of data retrieval method and device
CN109767269A (en) A kind for the treatment of method and apparatus of game data
CN110458296A (en) The labeling method and device of object event, storage medium and electronic device
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN103324641B (en) Information record recommendation method and device
CN107451249B (en) Event development trend prediction method and device
CN106294785A (en) Content Selection method and system
CN113627160B (en) Text error correction method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant