CN109657134A - A kind of data filtering method and device - Google Patents
A kind of data filtering method and device Download PDFInfo
- Publication number
- CN109657134A CN109657134A CN201811313297.1A CN201811313297A CN109657134A CN 109657134 A CN109657134 A CN 109657134A CN 201811313297 A CN201811313297 A CN 201811313297A CN 109657134 A CN109657134 A CN 109657134A
- Authority
- CN
- China
- Prior art keywords
- data
- sensitive keys
- tested
- caption information
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001914 filtration Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000004590 computer program Methods 0.000 claims description 15
- 238000003860 storage Methods 0.000 claims description 14
- 238000012795 verification Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 6
- 230000035945 sensitivity Effects 0.000 claims description 5
- 241001269238 Data Species 0.000 abstract description 7
- 230000001737 promoting effect Effects 0.000 abstract 1
- 238000001514 detection method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 206010001488 Aggression Diseases 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 208000001491 myopia Diseases 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
Landscapes
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data filtering method and devices, which comprises the caption information for obtaining data to be tested judges whether the caption information includes sensitive keys word in predetermined keyword library;If the caption information includes the sensitive keys word in the predetermined keyword library, the quantity of the sensitive keys word is obtained;Obtain the network click amount of the data to be tested;The quantity for the sensitive keys word for including in network click amount and caption information based on the data to be tested filters the data to be tested.The junk datas such as violence, vulgar not only can be quickly filtered out based on scheme provided by the invention, can also judge potentially to hide the data for relatively needing to filter deeply in time, improve network environment while promoting filter efficiency.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of data filtering method and device.
Background technique
With the continuous development of network technology, more and more people by Web Publishing, transmitting and obtain various Information Numbers
According to.But since the covering surface of network is very wide, the data class and data mode propagated on network are also very much, such as text, figure
Picture, sound, video etc..It is low in addition to various news datas, recreation data, encyclopaedia data etc. in the data spread on network
The storage of other bad datas such as custom, violence is also higher, therefore, for these data suppress and filter it is particularly important.
Summary of the invention
On the present invention provides a kind of data filtering methods and device to overcome the above problem or at least be partially solved
State problem.
According to an aspect of the invention, there is provided a kind of data filtering method, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes predetermined keyword library
In sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive pass is obtained
The quantity of keyword;
Obtain the network click amount of the data to be tested;
The number for the sensitive keys word for including in network click amount and caption information based on the data to be tested
Amount filters the data to be tested.
Optionally, the sensitivity for including in the network click amount and caption information based on the data to be tested
The quantity of keyword filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and is wrapped in the caption information
The quantity of the sensitive keys word included is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and wraps in the caption information
The quantity of the sensitive keys word included is more than the second default value, then filters the data to be tested.
Optionally, it is described obtain data to be tested caption information, judge the caption information whether include
Sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to
The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested
Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, it is described judge the caption information whether include sensitive keys word in predetermined keyword library it
Before, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive of the article title information extraction of filter data is already expired and close
Keyword;
Predetermined keyword library is constructed based on the sensitive keys word.
Optionally, described to judge whether the caption information includes sensitive keys word in predetermined keyword library, packet
It includes:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption
Information includes the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged
Topic information does not include the sensitive keys word in predetermined keyword library.
Optionally, the data to be tested include internet video data;The caption for obtaining data to be tested
Information judges whether the caption information includes sensitive keys word in predetermined keyword library, comprising:
Obtain the caption letter of the video data of the video data for being stored in video server and/or the main live streaming of live streaming
Breath judges whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, it is described obtain data to be tested caption information, judge the caption information whether include
Sensitive keys word in predetermined keyword library, further includes:
Obtain the currently watched video data of user caption information, judge the caption information whether include
Sensitive keys word in predetermined keyword library.
According to another aspect of the present invention, a kind of data filtering device is additionally provided,
Judgment module is configured to obtain the caption information of data to be tested, whether judges the caption information
Including the sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys in the predetermined keyword library
Word then obtains the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module, be configured in the network click amount and caption information of the data to be tested include
The quantity of sensitive keys word filters the data to be tested.
Optionally, the filtering module includes:
First filter element is configured to when the network click amount of the data to be tested be more than the first default click volume, and
When the quantity for the sensitive keys word for including in the caption information is more than the first default value, the number to be detected is filtered
According to;And/or
Second filter element is configured to the network click amount when the data to be tested lower than the second default click volume, and
When the quantity for the sensitive keys word for including in the caption information is more than the second default value, the number to be detected is filtered
According to.
Optionally, the judgment module is additionally configured to:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to
The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested
Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
Optionally, the judgment module is additionally configured to:
Before judging whether the caption information include sensitive keys word in predetermined keyword library, obtains and pass through
The sensitive keys word of manual examination and verification and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on the sensitive keys word.
Optionally, the judgment module is additionally configured to:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
When the sensitive keys word successful match in the word and the predetermined keyword library, the caption is judged
Information includes the sensitive keys word in predetermined keyword library;
When the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged
Topic information does not include the sensitive keys word in predetermined keyword library.
Optionally, the data to be tested include internet video data;
The judgment module is additionally configured to obtain the video data for being stored in video server and/or the main live streaming of live streaming
Video data caption information, judge whether the caption information includes sensitive keys in predetermined keyword library
Word.
Optionally, the judgment module is additionally configured to obtain the caption information of the currently watched video data of user,
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
According to another aspect of the present invention, a kind of computer storage medium is additionally provided, the computer storage medium is deposited
Computer program code is contained, when the computer program code is run on the computing device, the calculating equipment is caused to be held
Row data filtering method described in any of the above embodiments.
According to another aspect of the present invention, a kind of calculating equipment is additionally provided, comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, the calculating equipment is caused to execute any of the above-described
The data filtering method.
The present invention provides a kind of more efficient data filtering method and devices, in data filtering side provided by the invention
In method, by judging whether the caption information of data to be tested includes sensitive keys word, and judging to include sensitive close
Its quantity is obtained after keyword, meanwhile, the network click amount of data to be tested is also obtained, judges that it propagates temperature, and then combine
The network click amount of sensitive keys word quantity and data to be tested in the caption information of data to be tested carries out it
Filtering.Based on data filtering method provided by the invention, by using sensitive keys word and temperature combination mode to be checked
Measured data is filtered detection, not only can directly filter out the junk datas such as violence, vulgar, can also judge in time potential
The hiding data for relatively needing to filter deeply, and then promote the filter efficiency of bad data and junk data, improve network environment.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter
The above and other objects, advantages and features of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is data filtering method flow diagram according to an embodiment of the present invention;
Fig. 2 is data filtering method flow diagram according to the preferred embodiment of the invention;
Fig. 3 is data filtering device structural schematic diagram according to an embodiment of the present invention;
Fig. 4 is data filtering device structural schematic diagram according to the preferred embodiment of the invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Fig. 1 is data filtering method flow diagram according to an embodiment of the present invention, as shown in Figure 1, real according to the present invention
The data filtering method for applying example may include:
Step S102 obtains the caption information of data to be tested, judges whether above-mentioned caption information includes pre-
If the sensitive keys word in keywords database;
Step S104 is obtained above-mentioned if above-mentioned caption information includes the sensitive keys word in predetermined keyword library
The quantity of sensitive keys word;
Step S106 obtains the network click amount of data to be tested;
Step S108, the sensitive keys word for including in network click amount and caption information based on data to be tested
Quantity filter data to be tested.
The embodiment of the invention provides a kind of efficient data filtering methods, by the caption for judging data to be tested
Whether information includes sensitive keys word, and judge include obtain its quantity after sensitive keys word, meanwhile, can also obtain to
The network click amount of detection data judges that it propagates temperature, and then the sensitivity in the caption information of combination data to be tested
Keyword quantity and the network click amount of data to be tested are filtered it.Due to bad datas such as part is vulgar, violences
Spread speed and temperature may be larger, therefore, data filtering method based on the embodiment of the present invention is closed based on sensitive
The mode of the combination of keyword and temperature is filtered detection to data to be tested, not only can directly filter out violence, vulgar etc.
Junk data can also be judged potentially to hide the data for relatively needing to filter deeply in time, and then promote bad data and rubbish
The filter efficiency of rubbish data improves network environment.
Optionally, the data to be tested in the embodiment of the present invention can be the data such as Internet picture data, video data,
It can be video data and/or the live streaming master for being stored in video server when data to be tested are internet video data
The video data of live streaming.That is, above-mentioned steps S102 can also include: obtain be stored in video server video data and/
Or the caption information of the video data of the main live streaming of live streaming, judge whether caption information includes in predetermined keyword library
Sensitive keys word.With the rise of short-sighted frequency, more and more users can by the network for the video transmission clapped conveniently with other
The network user shares, and the video data that each network user uploads is stored in video server, by examining to it
It is issued to other network users again after surveying filtering, it is ensured that the Internet Security of other network users avoids the propagation of bad data.
In addition, live streaming is also data dissemination mode more popular at present, but since data volume is too big and data pass
It is uncontrollable to broadcast speed, therefore, in addition to detecting to the video data being stored on video server in the embodiment of the present invention
Except, the video data that main live streaming is broadcast live can also be detected, bad data is filtered suppress in time, purify network
Environment.
In an alternative embodiment of the invention, above-mentioned steps S102 can also include obtaining the currently watched video data of user
Caption information, judge whether the text heading message includes sensitive keys word in predetermined keyword library.For some
For video data, other users request may be just had when some user just uploads, it is therefore, real based on the present invention
The scheme for applying example offer can detect the currently watched video data of user, that is, provide a user the same of video traffic
In addition to this detection of Shi Jinhang data is also possible to detect video data before providing a user video traffic,
To further increase the treatment effeciency of data.
It refers to, can be wrapped in network click amount and caption information based on data to be tested in above-mentioned steps S108
The quantity of the sensitive keys word included filters data to be tested.In a preferred embodiment of the invention, filtering rule can be preset, when
The quantity for the sensitive keys word for including in the network click amount and caption information of data to be tested meets above-mentioned filtering rule
When then, data to be tested can be filtered.For video data, some video datas may be vulgar, but the view
Sensitive keys word in the article title information of frequency evidence is intended only as interesting part and includes once in a while, it should be noted that
It is that the click volume of the video data may be very big, therefore, is containing some sensitive keys words and click volume is extra high
In the case of, this video data is likely to just belong to vulgar video, should just be filtered.And those caption itself is believed
Data in breath including a large amount of sensitive keys words also can be filtered directly even if its click volume is not high.That is, above-mentioned steps S108 exists
May include following manner when filtering video to be detected:
Step S108-1, if the network click amount of institute's data to be tested is more than the first default click volume, and caption is believed
The quantity for the sensitive keys word for including in breath is more than the first default value, then filters the data to be tested;And/or
Step S108-2, if the network click amount of data to be tested is lower than the second default click volume, and caption information
In include the quantity of sensitive keys word be more than the second default value, then filter the data to be tested.
That is, the embodiment of the present invention can be preset about parameter relevant to data temperature such as network click amount,
I.e. first default click volume and the second default click volume, and the correlation about the sensitive keys word in data literal heading message
Parameter, i.e. the first default value and the second default value.Assuming that setting the first default click volume in practical application as 1000, second
Default click volume is also 1000, and the first default value is 2, and the second default value is 5, then for data to be tested, first
Its network click amount is obtained, then obtains the sensitive keys word number in its caption information, if the net of the data to be tested
Network click volume is greater than 1000, simultaneously again includes 2 or more sensitive keys words in caption information, at this point, to be checked to this
Measured data carries out suppressing filter.Or network click amount is less than 1000, but that includes 6 sensitive keys words, at this point,
The data to be tested can be equally filtered.The sensitive keys word in caption information can also be first obtained in practical application
Number, then network click amount is obtained, and click volume default for first, the second default click volume, the first default value and second
Default value can also be configured according to different scene and filtration needs, and the present invention is without limitation.
Since web database technology is very big, when being filtered processing to data, if examined one by one to all data
It surveys, workload may be larger, and low efficiency.It therefore, in a preferred embodiment, can be with when choosing data to be tested
The network click amount of each data in presetting database is first obtained, and is ranked up based on network click amount, according to network click amount
Data within a preset range generate hot data library;Any data in hot data library is chosen again as data to be tested,
To obtain the caption information of data to be tested, judge whether caption information includes sensitivity in predetermined keyword library
Keyword.
It is mentioned above, the filtering of data to be tested is network click amount and caption information institute based on data to be tested
Including the judgement of sensitive keys word quantity realize, therefore, can be first based in presetting database in the preferred embodiment of the present invention
Data be ranked up according to network click amount, and then filter out temperature higher data building hot data library, then based on heat
Data in gated data library filter after being detected as data to be tested.Since network click amount originally belongs to data to be tested
One of filter criteria, and known therefore the network click amount of each data is being filtered in building hot data library
When processing, only sensitive keys word quantity included in its caption information need to be judged i.e. from data to be tested are wherein chosen
Can, further to promote data filtering efficiency, save detection time.
For example, can carry out gradient division for the data in presetting database, i.e. acquisition network click amount is Top10's
Data construct hot data library, for the hot data in hot data library, are in the case that click volume is greater than 1000 on the day of
High dsc data also includes several sensitive keys words in text performance information, may determine that the data to be tested may at this time
Be exactly it is vulgar, will be filtered;For network click amount lower than 1000, the sensitive keys word for belonging to low-heat data, but including
Again very much, it is also likely to be vulgar for equally judging the data to be tested, it is also desirable to be filtered.
It introduces above, after obtaining data to be tested caption information, judges whether caption information includes pre-
If the sensitive keys word in keywords database, wherein predetermined keyword library can be the keywords database constructed in advance, excellent in the present invention
It selects in embodiment, as shown in Fig. 2, can also include: before above-mentioned steps S102
Step S110, obtains sensitive keys word Jing Guo manual examination and verification and/or the article title information of filter data is already expired and mention
The sensitive keys word taken;
Step S112 constructs predetermined keyword library based on above-mentioned sensitive keys word.
There are many type for the rubbish for needing to filter in network data, in embodiments of the present invention, can obtain respectively different
The junk data of type, such as vulgar, violence, and then obtain and extracted based on the data article title information Jing Guo manual examination and verification
Different types of sensitive keys word, the usual word of junk information can also be obtained as sensitive keys word, it is above-mentioned getting
Predetermined keyword library can be constructed after sensitive keys word.Sensitive keys word in predetermined keyword library, can root when being stored
It is stored, can also be stored according to the frequency of use height of each sensitive keys word, the present invention is not done according to respective type
It limits.
Optionally, the article title information of data to be tested is matched with the sensitive keys word in predetermined keyword library
When, it may include following steps:
S1 segments caption information, obtains at least one word that above-mentioned caption information includes;
S2 matches above-mentioned word with the sensitive keys word in predetermined keyword library;If above-mentioned word and default pass
Sensitive keys word successful match in keyword library then judges that caption information includes the sensitive keys in predetermined keyword library
Word;If above-mentioned word matches unsuccessful with the sensitive keys word in predetermined keyword library, judge that caption information is not wrapped
Include the sensitive keys word in predetermined keyword library.
Participle, i.e., be cut into individual word one by one for a chinese character sequence.In the embodiment of the present invention, to data to be tested
Caption information segmented after, first stop word can be washed, only retain and have the word of physical meaning, and then again will
Word after participle is matched with the sensitive keys word in predetermined keyword library respectively, as long as having a word and default key
Sensitive keys word successful match in word, that is, can determine whether in the article title information of the data to be tested include sensitive keys word,
If whole words match unsuccessful, judge not including sensitive keys word in the article title information of data to be tested.
Based on the same inventive concept, the embodiment of the invention also provides a kind of data filtering devices, as shown in figure 3, according to
Data filtering device provided in an embodiment of the present invention may include:
Judgment module 310 is configured to obtain the caption information of data to be tested, judges that above-mentioned caption information is
The no sensitive keys word including in predetermined keyword library;
First obtains module 320, if being configured to above-mentioned caption information includes the sensitive keys in predetermined keyword library
Word then obtains the quantity of sensitive keys word;
Second obtains module 330, is configured to obtain the network click amount of data to be tested;
Filtering module 340, be configured in the network click amount and caption information of data to be tested include
The quantity of sensitive keys word filters data to be tested.
In a preferred embodiment, as shown in figure 4, filtering module 340 may include:
First filter element 341 is configured to when the network click amount of data to be tested be more than the first default click volume, and text
When the quantity for the sensitive keys word for including in word heading message is more than the first default value, data to be tested are filtered;And/or
Second filter element 342 is configured to be lower than the second default click volume, and text when the network click amount of data to be tested
When the quantity for the sensitive keys word for including in word heading message is more than the second default value, data to be tested are filtered.
In a preferred embodiment, judgment module 310 is also configured as:
The network click amount of each data in presetting database is obtained, and is ranked up based on network click amount, according to network
The data of click volume within a preset range generate hot data library;
Any data in hot data library is chosen as data to be tested, obtains the caption letter of data to be tested
Breath;
Judge whether caption information includes sensitive keys word in predetermined keyword library.
In a preferred embodiment, judgment module 310 is also configured as:
Before judging whether caption information include sensitive keys word in predetermined keyword library, obtain by artificial
The sensitive keys word of audit and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on sensitive keys word.
In a preferred embodiment, judgment module 310 is also configured as:
Caption information is segmented, at least one word that caption information includes is obtained;
Above-mentioned word is matched with the sensitive keys word in predetermined keyword library;
When the sensitive keys word successful match in above-mentioned word and predetermined keyword library, judge that caption information includes
Sensitive keys word in predetermined keyword library;
When above-mentioned word matches unsuccessful with the sensitive keys word in predetermined keyword library, judge that caption information does not have
Have including the sensitive keys word in predetermined keyword library.
In a preferred embodiment, data to be tested include internet video data;
Judgment module 310 is also configured as obtaining the video data for being stored in video server and/or live streaming is main straight
The caption information for the video data broadcast judges whether caption information includes sensitive keys in predetermined keyword library
Word.
In a preferred embodiment, judgment module 310 are also configured as obtaining the currently watched video of user
The caption information of data judges whether caption information includes sensitive keys word in predetermined keyword library.
Based on the same inventive concept, the embodiment of the invention also provides a kind of computer storage medium, computer storages
Media storage has computer program code, when computer program code is run on the computing device, causes to calculate equipment execution
Data filtering method described in any of the above embodiments
Based on the same inventive concept, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Processor;
It is stored with the memory of computer program code;
When computer program code is run by processor, cause to calculate equipment execution data mistake described in any of the above embodiments
Filtering method.
The embodiment of the invention provides a kind of more efficient data filtering method and devices, are based on sensitive keys word and heat
The mode of the combination of degree is filtered detection to data to be tested, not only can directly filter out the junk datas such as violence, vulgar,
It can also judge potentially to hide the data for relatively needing to filter deeply in time, and then promote the mistake of bad data and junk data
Efficiency is filtered, is purified Internet environment.In addition, being also based on number in presetting database in scheme provided in an embodiment of the present invention
According to network click amount obtain hot data, and then screen data to be tested wherein, and be in advance based on and have been subjected to artificial examine
The data or idiom of core construct predetermined keyword library, can promote the mistake of data while further saving detection time
Filter treatment effeciency.
It is apparent to those skilled in the art that the specific work of the system of foregoing description, device and unit
Make process, can refer to corresponding processes in the foregoing method embodiment, for brevity, does not repeat separately herein.
In addition, each functional unit in each embodiment of the present invention can be physically independent, can also two or
More than two functional units integrate, and can be all integrated in a processing unit with all functional units.It is above-mentioned integrated
Functional unit both can take the form of hardware realization, can also be realized in the form of software or firmware.
Those of ordinary skill in the art will appreciate that: if the integrated functional unit is realized and is made in the form of software
It is independent product when selling or using, can store in a computer readable storage medium.Based on this understanding,
Technical solution of the present invention is substantially or all or part of the technical solution can be embodied in the form of software products,
The computer software product is stored in a storage medium comprising some instructions, with so that calculating equipment (such as
Personal computer, server or network equipment etc.) various embodiments of the present invention the method is executed when running described instruction
All or part of the steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM), random access memory
Device (RAM), the various media that can store program code such as magnetic or disk.
Alternatively, realizing that all or part of the steps of preceding method embodiment can be (all by the relevant hardware of program instruction
Such as personal computer, the calculating equipment of server or network equipment etc.) it completes, described program instruction can store in one
In computer-readable storage medium, when described program instruction is executed by the processor of calculating equipment, the calculating equipment is held
The all or part of the steps of row various embodiments of the present invention the method.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that: at this
Within the spirit and principle of invention, it is still possible to modify the technical solutions described in the foregoing embodiments or right
Some or all of the technical features are equivalently replaced;And these are modified or replaceed, and do not make corresponding technical solution de-
From protection scope of the present invention.
According to an aspect of an embodiment of the present invention, a kind of data filtering method of A1. is provided, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes predetermined keyword library
In sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive pass is obtained
The quantity of keyword;
Obtain the network click amount of the data to be tested;
The number for the sensitive keys word for including in network click amount and caption information based on the data to be tested
Amount filters the data to be tested.
A2. method according to a1, wherein the network click amount and text mark based on the data to be tested
The quantity for the sensitive keys word for including in topic information filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and is wrapped in the caption information
The quantity of the sensitive keys word included is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and wraps in the caption information
The quantity of the sensitive keys word included is more than the second default value, then filters the data to be tested.
A3. method according to a1, wherein the caption information for obtaining data to be tested judges the text
Whether word heading message includes sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to
The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested
Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
A4. method according to a3, wherein described to judge whether the caption information includes predetermined keyword library
In sensitive keys word before, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive of the article title information extraction of filter data is already expired and close
Keyword;
Predetermined keyword library is constructed based on the sensitive keys word.
A5. method according to a3, wherein described to judge whether the caption information includes predetermined keyword library
In sensitive keys word, comprising:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption
Information includes the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged
Topic information does not include the sensitive keys word in predetermined keyword library.
A6. according to the described in any item methods of A1-A5, wherein the data to be tested include internet video data;Institute
The caption information for obtaining data to be tested is stated, judges whether the caption information includes quick in predetermined keyword library
Feel keyword, comprising:
Obtain the caption letter of the video data of the video data for being stored in video server and/or the main live streaming of live streaming
Breath judges whether the caption information includes sensitive keys word in predetermined keyword library.
A7. the method according to A6, wherein the caption information for obtaining data to be tested judges the text
Whether word heading message includes sensitive keys word in predetermined keyword library, further includes:
Obtain the currently watched video data of user caption information, judge the caption information whether include
Sensitive keys word in predetermined keyword library.
Other side according to an embodiment of the present invention additionally provides a kind of data filtering device of B8., comprising:
Judgment module is configured to obtain the caption information of data to be tested, whether judges the caption information
Including the sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys in the predetermined keyword library
Word then obtains the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module, be configured in the network click amount and caption information of the data to be tested include
The quantity of sensitive keys word filters the data to be tested.
B9. the device according to B8, wherein the filtering module includes:
First filter element is configured to when the network click amount of the data to be tested be more than the first default click volume, and
When the quantity for the sensitive keys word for including in the caption information is more than the first default value, the number to be detected is filtered
According to;And/or
Second filter element is configured to the network click amount when the data to be tested lower than the second default click volume, and
When the quantity for the sensitive keys word for including in the caption information is more than the second default value, the number to be detected is filtered
According to.
B10. the device according to B8, wherein the judgment module is additionally configured to:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to
The data of network click amount within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the text of the data to be tested
Heading message;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
B11. device according to b10, wherein the judgment module is additionally configured to:
Before judging whether the caption information include sensitive keys word in predetermined keyword library, obtains and pass through
The sensitive keys word of manual examination and verification and/or be already expired filter data article title information extraction sensitive keys word;
Predetermined keyword library is constructed based on the sensitive keys word.
B12. device according to b10, wherein the judgment module is additionally configured to:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
When the sensitive keys word successful match in the word and the predetermined keyword library, the caption is judged
Information includes the sensitive keys word in predetermined keyword library;
When the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, the text mark is judged
Topic information does not include the sensitive keys word in predetermined keyword library.
B13. according to the described in any item devices of B8-B12, wherein the data to be tested include internet video data;
The judgment module is additionally configured to obtain the video data for being stored in video server and/or the main live streaming of live streaming
Video data caption information, judge whether the caption information includes sensitive keys in predetermined keyword library
Word.
B14. according to the described in any item devices of B8-B12, wherein it is current to be additionally configured to acquisition user for the judgment module
The caption information of the video data of viewing judges whether the caption information includes sensitivity in predetermined keyword library
Keyword.
Other side according to an embodiment of the present invention additionally provides a kind of computer storage medium of C15., the calculating
Machine storage medium is stored with computer program code, when the computer program code is run on the computing device, leads to institute
It states and calculates the equipment execution described in any item data filtering methods of A1-A7.
Other side according to an embodiment of the present invention additionally provides a kind of calculating equipment of D16., comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, the calculating equipment is caused to execute A1-A7 any
Data filtering method described in.
Claims (10)
1. a kind of data filtering method, comprising:
The caption information for obtaining data to be tested, judges whether the caption information includes in predetermined keyword library
Sensitive keys word;
If the caption information includes the sensitive keys word in the predetermined keyword library, the sensitive keys word is obtained
Quantity;
Obtain the network click amount of the data to be tested;
The quantity mistake for the sensitive keys word for including in network click amount and caption information based on the data to be tested
Filter the data to be tested.
2. according to the method described in claim 1, wherein, the network click amount and text based on the data to be tested
The quantity for the sensitive keys word for including in heading message filters the data to be tested, comprising:
If the network click amount of the data to be tested is more than the first default click volume, and includes in the caption information
The quantity of sensitive keys word is more than the first default value, then filters the data to be tested;And/or
If the network click amount of the data to be tested is lower than the second default click volume, and includes in the caption information
The quantity of sensitive keys word is more than the second default value, then filters the data to be tested.
3. according to the method described in claim 1, wherein, the caption information for obtaining data to be tested, described in judgement
Whether caption information includes sensitive keys word in predetermined keyword library, comprising:
The network click amount of each data in presetting database is obtained, and is ranked up based on the network click amount, according to network
The data of click volume within a preset range generate hot data library;
Any data in the hot data library is chosen as data to be tested, obtains the caption of the data to be tested
Information;
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
4. described to judge whether the caption information includes predetermined keyword according to the method described in claim 3, wherein
Before sensitive keys word in library, further includes:
It obtains the sensitive keys word Jing Guo manual examination and verification and/or the sensitive keys of the article title information extraction of filter data is already expired
Word;
Predetermined keyword library is constructed based on the sensitive keys word.
5. described to judge whether the caption information includes predetermined keyword according to the method described in claim 3, wherein
Sensitive keys word in library, comprising:
The caption information is segmented, at least one word that the caption information includes is obtained;
The word is matched with the sensitive keys word in the predetermined keyword library;
If the sensitive keys word successful match in the word and the predetermined keyword library, judges the caption information
Including the sensitive keys word in predetermined keyword library;
If the word matches unsuccessful with the sensitive keys word in the predetermined keyword library, caption letter is judged
It ceases without including the sensitive keys word in predetermined keyword library.
6. method according to claim 1-5, wherein the data to be tested include internet video data;
The caption information for obtaining data to be tested, judges whether the caption information includes in predetermined keyword library
Sensitive keys word, comprising:
The caption information of the video data of the video data for being stored in video server and/or the main live streaming of live streaming is obtained,
Judge whether the caption information includes sensitive keys word in predetermined keyword library.
7. according to the method described in claim 6, wherein, the caption information for obtaining data to be tested, described in judgement
Whether caption information includes sensitive keys word in predetermined keyword library, further includes:
The caption information for obtaining the currently watched video data of user, judges whether the caption information includes presetting
Sensitive keys word in keywords database.
8. a kind of data filtering device, comprising:
Judgment module, be configured to obtain data to be tested caption information, judge the caption information whether include
Sensitive keys word in predetermined keyword library;
First obtains module, if being configured to the caption information includes the sensitive keys word in the predetermined keyword library,
Then obtain the quantity of the sensitive keys word;
Second obtains module, is configured to obtain the network click amount of the data to be tested;
Filtering module is configured to the sensitivity in the network click amount and caption information of the data to be tested included
The quantity of keyword filters the data to be tested.
9. a kind of computer storage medium, the computer storage medium is stored with computer program code, when the computer
When program code is run on the computing device, the calculating equipment perform claim is caused to require the described in any item data mistakes of 1-7
Filtering method.
10. a kind of calculating equipment, comprising:
Processor;
It is stored with the memory of computer program code;
When the computer program code is run by the processor, cause the calculating equipment perform claim that 1-7 is required to appoint
Data filtering method described in one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811313297.1A CN109657134A (en) | 2018-11-06 | 2018-11-06 | A kind of data filtering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811313297.1A CN109657134A (en) | 2018-11-06 | 2018-11-06 | A kind of data filtering method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109657134A true CN109657134A (en) | 2019-04-19 |
Family
ID=66110134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811313297.1A Pending CN109657134A (en) | 2018-11-06 | 2018-11-06 | A kind of data filtering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109657134A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767211A (en) * | 2019-09-23 | 2020-02-07 | 浙江从泰网络科技有限公司 | Voice synthesis broadcasting system based on text content data cleaning |
CN110971619A (en) * | 2020-01-02 | 2020-04-07 | 惠州学院 | Network technology security system and method with bad information filtering processing |
CN111586421A (en) * | 2020-01-20 | 2020-08-25 | 全息空间(深圳)智能科技有限公司 | Method, system and storage medium for auditing live broadcast platform information |
CN113891120A (en) * | 2021-09-29 | 2022-01-04 | 广东省高峰科技有限公司 | IPTV service terminal access method and system thereof |
CN114840776A (en) * | 2022-07-04 | 2022-08-02 | 北京拓普丰联信息科技股份有限公司 | Method, device, electronic equipment and storage medium for recording data publishing source |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160350282A1 (en) * | 2014-02-25 | 2016-12-01 | Tencent Technology (Shenzhen) Company Limited | Sensitive text detecting method and apparatus |
CN106445998A (en) * | 2016-05-26 | 2017-02-22 | 达而观信息科技(上海)有限公司 | Text content auditing method and system based on sensitive word |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN108153723A (en) * | 2017-12-27 | 2018-06-12 | 北京百度网讯科技有限公司 | Hot spot information comment generation method, device and terminal device |
CN108304379A (en) * | 2018-01-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of article recognition methods, device and storage medium |
-
2018
- 2018-11-06 CN CN201811313297.1A patent/CN109657134A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160350282A1 (en) * | 2014-02-25 | 2016-12-01 | Tencent Technology (Shenzhen) Company Limited | Sensitive text detecting method and apparatus |
CN106445998A (en) * | 2016-05-26 | 2017-02-22 | 达而观信息科技(上海)有限公司 | Text content auditing method and system based on sensitive word |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN108153723A (en) * | 2017-12-27 | 2018-06-12 | 北京百度网讯科技有限公司 | Hot spot information comment generation method, device and terminal device |
CN108304379A (en) * | 2018-01-15 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of article recognition methods, device and storage medium |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110767211A (en) * | 2019-09-23 | 2020-02-07 | 浙江从泰网络科技有限公司 | Voice synthesis broadcasting system based on text content data cleaning |
CN110767211B (en) * | 2019-09-23 | 2022-02-18 | 浙江斑智科技有限公司 | Voice synthesis broadcasting system based on text content data cleaning |
CN110971619A (en) * | 2020-01-02 | 2020-04-07 | 惠州学院 | Network technology security system and method with bad information filtering processing |
CN111586421A (en) * | 2020-01-20 | 2020-08-25 | 全息空间(深圳)智能科技有限公司 | Method, system and storage medium for auditing live broadcast platform information |
CN113891120A (en) * | 2021-09-29 | 2022-01-04 | 广东省高峰科技有限公司 | IPTV service terminal access method and system thereof |
CN114840776A (en) * | 2022-07-04 | 2022-08-02 | 北京拓普丰联信息科技股份有限公司 | Method, device, electronic equipment and storage medium for recording data publishing source |
CN114840776B (en) * | 2022-07-04 | 2022-09-20 | 北京拓普丰联信息科技股份有限公司 | Method, device, electronic equipment and storage medium for recording data publishing source |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109657134A (en) | A kind of data filtering method and device | |
CN102279875B (en) | Method and device for identifying fishing website | |
Myers et al. | Information diffusion and external influence in networks | |
CN110233849B (en) | Method and system for analyzing network security situation | |
US9300755B2 (en) | System and method for determining information reliability | |
US10032081B2 (en) | Content-based video representation | |
CN109684483B (en) | Knowledge graph construction method and device, computer equipment and storage medium | |
CN105072089B (en) | A kind of WEB malice scanning behavior method for detecting abnormality and system | |
US9600530B2 (en) | Updating a search index used to facilitate application searches | |
TWI498752B (en) | Extracting information from unstructured data and mapping the information to a structured schema using the naive bayesian probability model | |
Hristakieva et al. | The spread of propaganda by coordinated communities on social media | |
CN108334758B (en) | Method, device and equipment for detecting user unauthorized behavior | |
US8788925B1 (en) | Authorized syndicated descriptions of linked web content displayed with links in user-generated content | |
CN105183781B (en) | Information recommendation method and device | |
CN105488023B (en) | A kind of text similarity appraisal procedure and device | |
Middleton et al. | Geoparsing and geosemantics for social media: Spatiotemporal grounding of content propagating rumors to support trust and veracity analysis during breaking news | |
CN112464036B (en) | Method and device for auditing violation data | |
CN109376231A (en) | A kind of media hotspot tracking and system | |
CN109190014A (en) | A kind of regular expression generation method, device and electronic equipment | |
US8171020B1 (en) | Spam detection for user-generated multimedia items based on appearance in popular queries | |
CN104331490B (en) | network data processing method and device | |
EP2680210A1 (en) | Method and system for cross-platform content recommendation | |
Fletcher et al. | Practical web traffic analysis: standards, privacy, techniques, and results | |
WO2020007367A1 (en) | Method for inspecting abnormal web access, device, medium, and equipment | |
CN113918435B (en) | Method and device for determining risk level of application program and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190419 |
|
RJ01 | Rejection of invention patent application after publication |