CN112364216A - Edge node content auditing and filtering system and method - Google Patents

Edge node content auditing and filtering system and method Download PDF

Info

Publication number
CN112364216A
CN112364216A CN202011321876.8A CN202011321876A CN112364216A CN 112364216 A CN112364216 A CN 112364216A CN 202011321876 A CN202011321876 A CN 202011321876A CN 112364216 A CN112364216 A CN 112364216A
Authority
CN
China
Prior art keywords
sensitive
data
request
module
auditing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011321876.8A
Other languages
Chinese (zh)
Inventor
肖何
王金高
唐雅琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jingxin Network Technology Co ltd
Original Assignee
Shanghai Jingxin Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jingxin Network Technology Co ltd filed Critical Shanghai Jingxin Network Technology Co ltd
Priority to CN202011321876.8A priority Critical patent/CN112364216A/en
Publication of CN112364216A publication Critical patent/CN112364216A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a system and a method for auditing and filtering contents of edge nodes, and belongs to the technical field of Internet. The invention comprises an edge node content auditing and filtering system, which comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module. The invention also comprises a method for auditing and filtering the content of the edge node, which comprises the following steps of S1, collecting data at the edge node for transmission and storage; s2, screening and checking and sorting; s3, sensitive characteristic auditing and outputting are carried out, and the request is filtered or intercepted; s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data; and S5, adjusting the sensitive characteristics according to the detection result, and purifying the network environment. The invention improves the auditing efficiency, reduces the resource waste, can carry out system self-check and constantly ensures that the user has a healthy network environment.

Description

Edge node content auditing and filtering system and method
Technical Field
The invention relates to the technical field of Internet, in particular to a system and a method for auditing and filtering contents of edge nodes.
Background
With the rapid development of internet technology, internet users are increasing, network hotspots are coming up endlessly, and the network brings convenience to the users and is also full of a great deal of sensitive information, such as bloody fishy violence, anti-political theory, yellow information, gambling fraud, advertisement and the like. Therefore, how to filter the information requested or uploaded by the user to create a healthy and safe network environment becomes an important task in current internet technology.
In a current common scheme, a system filters sensitive information uploaded or received by a user from a large amount of user data according to a blacklist rule, then detects the user and implements prohibition. In addition, in the process of content auditing, a large amount of data is not classified, and the priority auditing specification is not provided, so that a user cannot receive hotspot information at the first time, and the time delay exists. Meanwhile, the system has no detection means, does not have clear cognition on the current network environment, has ambiguity on the control of sensitive characteristics, and seriously influences the user experience and the normal operation of the network environment.
Therefore, an accurate and efficient content auditing and filtering system is urgently needed, and the system can accurately reflect the current state of the network environment.
Disclosure of Invention
The present invention is directed to a system and a method for auditing and filtering contents of edge nodes, so as to solve the problems in the background art.
In order to solve the technical problems, the invention provides the following technical scheme:
an edge node content auditing and filtering system, characterized by: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;
the output end of the data acquisition module is electrically connected with the input ends of the data analysis module and the database; the output end of the data analysis module is electrically connected with the input ends of the database and the data auditing module; the output end of the data auditing module is electrically connected with the output ends of the detection module and the output module; the output end of the detection module is electrically connected with the input end of the database; the output end of the database is electrically connected with the input end of the updating module; the output end of the updating module is electrically connected with the input end of the data analysis module.
According to the technical scheme, the data acquisition module comprises a historical data acquisition unit and a real-time data acquisition unit;
the historical data acquisition unit acquires data under a normal historical network environment and stores the data in a database, and the real-time data acquisition unit acquires a request sent by a user in real time and transmits the request to the data analysis module.
According to the technical scheme, the data analysis module comprises a scheduling sorting factor unit and an access path recording unit;
the scheduling sorting factor unit comprises an uploading amount, a downloading amount, an appraising amount and a searching amount, and the uploading amount is set to be marked as ai(ii) a The download amount is recorded as bi(ii) a The evaluation quantity is recorded as ci(ii) a The search quantity is recorded as di(ii) a Respectively form a set a ═ a1,a2,a3,…,an}、b={b1,b2,b3,…,bn}、c={c1,c2,c3,…,cn}、d={d1,d2,d3,…,dnIn which a isi、bi、ci、diAre all constant terms;
according to the formula:
Ki=ai+bi+ci+di-M
wherein KiThe index value is a hotspot event index value; m is a hot event index threshold;
when K isiIf the event is more than 0, the event is a hot event;
take all KiSorting from large to small to form a set K ═ K1,K2,K3,…,KnIn which K is1≥K2≥K3≥KnThe corresponding data is transmitted to a data auditing module according to the sequence in the set;
and the access path recording unit is used for correlating the access path to the user sending the request, recording and storing the access path to the database.
According to the technical scheme, the data auditing module comprises a sensitive text information auditing unit and a sensitive image information auditing unit;
the sensitive text information auditing unit comprises a sensitive vocabulary library and a text detection unit; the sensitive image information auditing unit comprises a sensitive image library and an image detection unit.
According to the technical scheme, the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;
the sensitive vocabulary detection is compared according to a sensitive vocabulary library, if the sensitive vocabulary exists, sensitive word labeling is carried out on the request, and then the user request is filtered or intercepted;
the homophone detection carries out voice reading on a text requested by a user, records sound wave images, selects sensitive vocabulary sound wave images in a sensitive vocabulary library for comparison, carries out sensitive word labeling on the request under the condition of consistent comparison, judges whether the labeled vocabulary has other meanings by using an intelligent input method, judges that the labeled vocabulary has homophone sensitive words if the labeled vocabulary does not have other meanings, and further filters or intercepts the user request;
the method comprises the steps of detecting the similar words, carrying out stroke coding on a text requested by a user, respectively setting a horizontal character, a vertical character, a left-falling character, a point character and a turning character according to the stroke sequence and the number of the text, for example, a cross character, a character is marked as AB, coding the character, comparing the coded character with a sensitive vocabulary code, setting a similarity threshold value as N, carrying out sensitive vocabulary labeling on the request when the similarity value of the two groups of codes is not less than N, judging whether the labeled vocabulary is proper in the context by using an intelligent input method, judging the similar sensitive vocabulary if the labeled vocabulary is improper, and further filtering or intercepting the user request.
According to the technical scheme, the sensitive image library is used for storing and updating sensitive image data; the image detection unit comprises face recognition detection and body privacy part exposure detection;
the face recognition detection is used for comparing sensitive face images in a sensitive image library, for example, filtering or intercepting the user request under the condition that the comparison is consistent with that of political event persons;
and the naked detection of the body privacy part is used for carrying out frame extraction detection on the image, finding that most of the privacy part is naked, and filtering or intercepting the user request.
According to the technical scheme, the updating module comprises an updating unit and a replacing unit;
when the updating unit receives a signal for adjusting the system, a new sensitive vocabulary is called from the database for updating;
and the replacing unit replaces the new sensitive vocabulary and transmits the new sensitive vocabulary to the data analysis module.
According to the technical scheme, the output module comprises a filtering unit and an intercepting unit;
the filtering unit is used for filtering the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Get request of the user;
the intercepting unit is used for intercepting the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Post request of the user.
A method for auditing and filtering contents of an edge node is characterized by comprising the following steps: the method comprises the following steps:
s1, collecting the real-time request of the user and the historical data of the network environment at the edge node, and storing the data in a database;
s2, calling a user real-time request, screening hot events according to a scheduling sorting factor, and preferentially entering an audit queue;
s3, performing sensitive characteristic audit on the real-time request, outputting, filtering under a Get request of a user, and intercepting under a Post request;
s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data;
and S5, according to the detection result, the network environment is adjusted by enhancing the sensitive information characteristics, and the user is ensured to have a healthy network environment.
According to the technical scheme, in the steps S4-S5, the health degree of the network environment is judged by using a naive Bayes model, so that the result is more accurate, specifically, the posterior probability is deduced by using the prior probability according to the accumulation of the original data, so that the result is more accurate;
according to the formula:
Figure BDA0002793185380000051
wherein h is a sensitive request, L is a real-time input request, P (h/L) is the probability that the real-time input request is the sensitive request, P (h) is the historical probability of the sensitive request under the healthy network environment, P (L/h) is the probability that the real-time input request appears in the sensitive request, and P (L) is the probability of inputting the real-time request, and P (L) is 1 because the real-time request is input in the scheme;
decomposing the real-time input request L to form independent condition sets Y ═ L1,L2,L3,…,Ln};
According to the formula:
P(L/h)=P(L1/h)*P(L2/h)*P(L3/h)*…*P(Ln/h);
wherein P (L)n/h) occurrence of independent condition L for sensitive requestnThe probability of (a) of (b) being,
according to the formula:
P(h/L)=P(h)*P(L1/h)*P(L2/h)*P(L3/h)*…*P(Ln/h);
and calculating P (h/L), calculating a network environment health threshold value as X according to the historical health network environment data, and transmitting a signal to a database for system adjustment when the P (h/L) is greater than X and the network environment is an unhealthy network environment.
Compared with the prior art, the invention has the following beneficial effects:
1. the edge nodes are used for content auditing and filtering, so that the auditing speed is increased, the requirement on hardware is reduced, the required amount of bandwidth is reduced, the requirement of a user is met in time, and the method has great advantages compared with the prior content auditing and filtering system;
2. in the auditing process, the homophone is audited in a sound wave comparison mode, so that the auditing accuracy can be improved, bad users are prevented from receiving or releasing sensitive information by using homophone different character characteristics of sensitive words, and the comprehensiveness of auditing the sensitive information is greatly improved; meanwhile, the stroke coding mode is adopted to audit the shape and the proximity of characters, so that adverse users are prevented from expressing sensitive information by approximating the shape of the characters and influencing network environment;
3. a naive Bayes model is added in the system, a normal healthy network environment threshold value is established by utilizing collected original data, the real-time request is decomposed, the posterior probability is solved according to the prior probability by utilizing independent and irrelevant conditions, and the probability that the real-time request becomes a sensitive request is used as a standard for measuring whether the network environment is healthy or not, so that the stable and normal operation of the system is maintained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic diagram of a module structure of an edge node content auditing and filtering system according to the present invention;
FIG. 2 is a schematic diagram illustrating the steps of a method for auditing and filtering the content of an edge node according to the present invention;
FIG. 3 is a flow chart of a system and method for edge node content audit filtering in accordance with the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides the following technical solutions:
in this embodiment, an edge node content auditing and filtering system is characterized in that: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;
the output end of the data acquisition module is electrically connected with the input ends of the data analysis module and the database; the output end of the data analysis module is electrically connected with the input ends of the database and the data auditing module; the output end of the data auditing module is electrically connected with the output ends of the detection module and the output module; the output end of the detection module is electrically connected with the input end of the database; the output end of the database is electrically connected with the input end of the updating module; the output end of the updating module is electrically connected with the input end of the data analysis module.
According to the technical scheme, the data acquisition module comprises a historical data acquisition unit and a real-time data acquisition unit;
the historical data acquisition unit acquires data under a normal historical network environment and stores the data in a database, and the real-time data acquisition unit acquires a request sent by a user in real time and transmits the request to the data analysis module.
According to the technical scheme, the data analysis module comprises a scheduling sorting factor unit and an access path recording unit;
the scheduling sorting factor unit comprises an uploading amount, a downloading amount, an appraising amount and a searching amount, and the uploading amount is set to be marked as ai(ii) a The download amount is recorded as bi(ii) a The evaluation quantity is recorded as ci(ii) a The search quantity is recorded as di
The data acquisition module acquires 5 real-time requests, and the information is as follows:
request 1: upload quantity a1Download amount b of 100120, evaluation c17, search quantity d1=500;
Request 2: upload quantity a260, download amount b210, amount of comments c21, search quantity d2=240;
Request 3: upload quantity a370, download amount b320, evaluation c310, search quantity d3=160;
Request 4: upload quantity a4140, download amount b4120, amount of comments c4600, search quantity d4=6500;
Request 5: upload quantity a5Download amount b 5056, amount of comments c57, search quantity d5=80;
Setting a hot spot event index threshold value M to be 300;
according to the formula:
Ki=ai+bi+ci+di-M
wherein KiObtaining K of each request for the index value of the hotspot eventiThe values are respectively: k1=327;K2=11;K3=-40;K4=7060;K5=-157;
Judging that the request 1, the request 2 and the request 4 are hot events;
take all KiSorting from big to small to form an audit sequence K4,K1,K2,K3,K5The corresponding requests are transmitted to a data auditing module in a sequencing mode;
according to the technical scheme, the data auditing module comprises a sensitive text information auditing unit and a sensitive image information auditing unit;
the sensitive text information auditing unit comprises a sensitive vocabulary library and a text detection unit; the sensitive image information auditing unit comprises a sensitive image library and an image detection unit;
the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;
the data acquisition module acquires that 4 of 5 real-time requests are text information and 1 is image information, and the information is as follows:
request 1: "search: is the user of the refrigerator held on his own? ";
request 2: "ask questions: how long bananas can be generally preserved? "
Request 3: "upload: TV drama "West Yong Ji". "
Request 4: "upload: you are a white eat. "
The sensitive vocabulary detection is compared according to a sensitive vocabulary library, and no sensitive vocabulary exists in 4 requests;
the homophonic character detection carries out voice reading on 4 requests in sequence, records sound wave images, selects sensitive vocabulary sound wave images in a sensitive vocabulary library for comparison, finds that the vocabulary 'eating without voice' is consistent with the sensitive vocabulary 'dementia', carries out sensitive word labeling on the requests, judges by using an intelligent input method, finds that the words mean expressing 'dementia' in the current context, judges as homophonic sensitive words, and further intercepts the requests 4;
the method comprises the steps of carrying out stroke coding on 4 request texts, respectively setting horizontal marks A, vertical marks B, left-falling marks C, point marks D and turning marks E, coding according to the stroke sequence and the number of the texts, finding that words 'user' is coded as 'DEAC', sensitive words 'dead' are coded as 'EAC', setting a coding similarity value threshold value N to be 1, carrying out sensitive word marking on a request 1 if the two groups of coding similarity values are not less than 1, judging by using an intelligent input method, finding that sentences are smooth and have no similar sensitive words, and further carrying out normal output on the request 1.
According to the technical scheme, the sensitive image library is used for storing and updating sensitive image data; the image detection unit comprises face recognition detection and body privacy part exposure detection;
the data acquisition module acquires 1 of 5 real-time requests as image information, and finds that no sensitive feature exists after face recognition detection and body privacy part exposure detection are carried out on the information, so that the request 5 is normally output.
According to the technical scheme, the updating module comprises an updating unit and a replacing unit, and when the updating unit receives a signal for adjusting the system, a new sensitive vocabulary is called from the database for updating; and the replacing unit replaces the new sensitive vocabulary and transmits the new sensitive vocabulary to the data analysis module.
According to the technical scheme, the output module comprises a filtering unit and an intercepting unit;
the filtering unit is used for filtering the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Get request of the user;
the intercepting unit is used for intercepting the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Post request of the user.
The invention provides a method for auditing and filtering contents of edge nodes, which is characterized by comprising the following steps: the method comprises the following steps:
s1, collecting the real-time request of the user and the historical data of the network environment at the edge node, and storing the data in a database;
s2, calling a user real-time request, screening hot events according to a scheduling sorting factor, and preferentially entering an audit queue;
s3, performing sensitive characteristic audit on the real-time request, outputting, filtering under a Get request of a user, and intercepting under a Post request;
s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data;
and S5, according to the detection result, the network environment is adjusted by enhancing the sensitive information characteristics, and the user is ensured to have a healthy network environment.
According to the technical scheme, in the steps S4-S5, the health degree of the network environment is judged by using a naive Bayes model, so that the result is more accurate, specifically, the posterior probability is deduced by using the prior probability according to the accumulation of the original data, so that the result is more accurate;
according to the formula:
Figure BDA0002793185380000101
wherein h is a sensitive request, L is a real-time input request, P (h/L) is the probability that the real-time input request is the sensitive request, P (h) is the historical probability of the sensitive request under the healthy network environment, P (L/h) is the probability that the real-time input request appears in the sensitive request, and P (L) is the probability of inputting the real-time request, and P (L) is 1 because the real-time request is input in the scheme;
in the embodiment, 100 pieces of random data in the historical healthy network environment are extracted, wherein 5 pieces of sensitive data are extracted;
extracting 10000 condition factors in sensitive data in a healthy network environment, and performing condition decomposition on a real-time input request L to obtain Y ═ L1,L2,L3,L4,L5}; condition factor L19000 occurrences in sensitive data; condition factor L2The number of occurrences in sensitive data was 8000; condition factor L3The number of occurrences in sensitive data was 7000; condition factor L4Number of occurrences in sensitive data8000 times; condition factor L5The number of occurrences in sensitive data was 8000;
according to the formula:
Figure BDA0002793185380000111
according to the formula:
Figure BDA0002793185380000112
according to the historical healthy network environment, the health threshold value of the network environment is calculated to be X which is 0.002, and 0.016128 is larger than 0.002, so that the network environment is an unhealthy network environment at the moment, and the detection module transmits a signal to the database to carry out system adjustment.
The working principle of the invention is as follows: the method comprises the steps of utilizing edge nodes to carry out content auditing and filtering, greatly improving auditing speed, reducing requirements on hardware, reducing bandwidth demand and meeting requirements of users in time, utilizing a data acquisition module to acquire historical data and real-time data, utilizing a database to call and store, utilizing a data analysis module to sort the priority of the data, utilizing a data auditing module to audit sensitive characteristics, utilizing an output module to filter or intercept user requests, utilizing a detection module to carry out health and stability verification on a system based on a naive Bayesian model, utilizing an updating module to update and replace the data in time under the condition of unhealthy network environment and ensuring stable operation of the system;
in the auditing process, the homophone is audited in a sound wave comparison mode, so that the auditing accuracy can be improved, bad users are prevented from receiving or releasing sensitive information by using homophone different character characteristics of sensitive words, and the comprehensiveness of auditing the sensitive information is greatly improved; meanwhile, the stroke coding mode is adopted to audit the shape and the proximity characters, so that adverse users are prevented from expressing sensitive information by approximating the shape of the Chinese characters, and the influence on the network environment is avoided.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An edge node content auditing and filtering system, characterized by: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;
the output end of the data acquisition module is electrically connected with the input ends of the data analysis module and the database; the output end of the data analysis module is electrically connected with the input ends of the database and the data auditing module; the output end of the data auditing module is electrically connected with the output ends of the detection module and the output module; the output end of the detection module is electrically connected with the input end of the database; the output end of the database is electrically connected with the input end of the updating module; the output end of the updating module is electrically connected with the input end of the data analysis module.
2. The edge node content auditing and filtering system of claim 1, characterized by: the data acquisition module comprises a historical data acquisition unit and a real-time data acquisition unit;
the historical data acquisition unit acquires data under a normal historical network environment and stores the data in a database, and the real-time data acquisition unit acquires a request sent by a user in real time and transmits the request to the data analysis module.
3. The edge node content auditing and filtering system of claim 1, characterized by: the data analysis module comprises a scheduling sorting factor unit and an access path recording unit;
the scheduling sorting factor unit comprises an uploading amount, a downloading amount, an appraising amount and a searching amount, and the uploading amount is set to be marked as ai(ii) a The download amount is recorded as bi(ii) a The evaluation quantity is recorded as ci(ii) a The search quantity is recorded as di(ii) a Respectively form a set a ═ a1,a2,a3,…,an}、b={b1,b2,b3,…,bn}、c={c1,c2,c3,…,cn}、d={d1,d2,d3,…,dnIn which a isi、bi、ci、diAre all constant terms;
according to the formula:
Ki=ai+bi+ci+di-M
wherein KiThe index value is a hotspot event index value; m is a hot event index threshold;
when K isiIf the event is more than 0, the event is a hot event;
take all KiSorting from large to small to form a set K ═ K1,K2,K3,…,KnIn which K is1≥K2≥K3≥KnThe corresponding data is transmitted to a data auditing module according to the sequence in the set;
and the access path recording unit is used for correlating the access path to the user sending the request, recording and storing the access path to the database.
4. The edge node content auditing and filtering system of claim 1, characterized by: the data auditing module comprises a sensitive text information auditing unit and a sensitive image information auditing unit;
the sensitive text information auditing unit comprises a sensitive vocabulary library and a text detection unit; the sensitive image information auditing unit comprises a sensitive image library and an image detection unit.
5. The edge node content auditing and filtering system of claim 4, characterized by: the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;
the sensitive vocabulary detection is compared according to a sensitive vocabulary library, if the sensitive vocabulary exists, sensitive word labeling is carried out on the request, and then the user request is filtered or intercepted;
the homophone detection carries out voice reading on a text requested by a user, records sound wave images, selects sensitive vocabulary sound wave images in a sensitive vocabulary library for comparison, carries out sensitive word labeling on the request under the condition of consistent comparison, judges whether the labeled vocabulary has other meanings by using an intelligent input method, judges that the labeled vocabulary has homophone sensitive words if the labeled vocabulary does not have other meanings, and further filters or intercepts the user request;
the method comprises the steps of detecting the similar words, carrying out stroke coding on a text requested by a user, respectively setting a horizontal character, a vertical character, a left-falling character, a point character and a turning character according to the stroke sequence and the number of the text, for example, a cross character, a character is marked as AB, coding the character, comparing the coded character with a sensitive vocabulary code, setting a similarity threshold value as N, carrying out sensitive vocabulary labeling on the request when the similarity value of the two groups of codes is not less than N, judging whether the labeled vocabulary is proper in the context by using an intelligent input method, judging the similar sensitive vocabulary if the labeled vocabulary is improper, and further filtering or intercepting the user request.
6. The edge node content auditing and filtering system of claim 4, characterized by: the sensitive image library is used for storing and updating sensitive image data; the image detection unit comprises face recognition detection and body privacy part exposure detection;
the face recognition detection is used for comparing sensitive face images in a sensitive image library, for example, filtering or intercepting the user request under the condition that the comparison is consistent with that of political event persons;
and the naked detection of the body privacy part is used for carrying out frame extraction detection on the image, finding that most of the privacy part is naked, and filtering or intercepting the user request.
7. The edge node content auditing and filtering system of claim 1, characterized by: the updating module comprises an updating unit and a replacing unit;
when the updating unit receives a signal for adjusting the system, a new sensitive vocabulary is called from the database for updating;
and the replacing unit replaces the new sensitive vocabulary and transmits the new sensitive vocabulary to the data analysis module.
8. The edge node content auditing and filtering system of claim 1, characterized by: the output module comprises a filtering unit and an intercepting unit;
the filtering unit is used for filtering the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Get request of the user;
the intercepting unit is used for intercepting the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Post request of the user.
9. A method for auditing and filtering contents of an edge node is characterized by comprising the following steps: the method comprises the following steps:
s1, collecting the real-time user input request and the network environment historical data at the edge node, and storing the collected data in a database;
s2, calling a user real-time request, screening hot events according to a scheduling sorting factor, and preferentially entering an audit queue;
s3, performing sensitive characteristic audit on the real-time request, outputting, filtering under a Get request of a user, and intercepting under a Post request;
s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data;
and S5, according to the detection result, the network environment is adjusted by enhancing the sensitive information characteristics, and the user is ensured to have a healthy network environment.
10. The method for auditing and filtering contents of an edge node according to claim 9, characterized by: in steps S4-S5, the health degree of the network environment is determined by using a naive bayes model, so that the result is more accurate, specifically, the result is more accurate by using the prior probability to deduce the posterior probability according to the accumulation of the original data;
according to the formula:
Figure FDA0002793185370000041
wherein h is a sensitive request, L is a real-time input request, P (h/L) is the probability that the real-time input request is the sensitive request, P (h) is the historical probability of the sensitive request under the healthy network environment, P (L/h) is the probability that the real-time input request appears in the sensitive request, and P (L) is the probability of inputting the real-time request, and is marked as 1;
decomposing the real-time input request L to form independent condition sets Y{L1,L2,L3,…,Ln};
According to the formula:
P(L/h)=P(L1/h)*P(L2/h)*P(L3/h)*...*P(Ln/h);
wherein P (L)n/h) occurrence of independent condition L for sensitive requestnThe probability of (a) of (b) being,
according to the formula:
P(h/L)=P(h)*P(L1/h)*P(L2/h)*P(L3/h)*…*P(Ln/h);
and calculating P (h/L), calculating a network environment health threshold value as X according to the historical health network environment data, and transmitting a signal to a database for system adjustment when the P (h/L) is greater than X and the network environment is an unhealthy network environment.
CN202011321876.8A 2020-11-23 2020-11-23 Edge node content auditing and filtering system and method Pending CN112364216A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011321876.8A CN112364216A (en) 2020-11-23 2020-11-23 Edge node content auditing and filtering system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011321876.8A CN112364216A (en) 2020-11-23 2020-11-23 Edge node content auditing and filtering system and method

Publications (1)

Publication Number Publication Date
CN112364216A true CN112364216A (en) 2021-02-12

Family

ID=74533180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011321876.8A Pending CN112364216A (en) 2020-11-23 2020-11-23 Edge node content auditing and filtering system and method

Country Status (1)

Country Link
CN (1) CN112364216A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779250A (en) * 2021-09-08 2021-12-10 上海松欣智能科技有限公司 Standardized text data processing system
CN114257828A (en) * 2021-12-20 2022-03-29 上海哔哩哔哩科技有限公司 Live broadcast audit content processing method and system
CN114760490A (en) * 2022-04-15 2022-07-15 上海哔哩哔哩科技有限公司 Video stream processing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138485A1 (en) * 2008-12-03 2010-06-03 William Weiyeh Chow System and method for providing virtual web access
CN207197660U (en) * 2017-09-15 2018-04-06 北京为韵科技有限公司 A kind of distribution type optical fiber sensing equipment for speech recognition and patrol work attendance
CN109948052A (en) * 2019-03-08 2019-06-28 上海七牛信息技术有限公司 A kind of internet information filtering auditing system, method and device
CN110225373A (en) * 2019-06-13 2019-09-10 腾讯科技(深圳)有限公司 A kind of video reviewing method, device and electronic equipment
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111368535A (en) * 2018-12-26 2020-07-03 珠海金山网络游戏科技有限公司 Sensitive word recognition method, device and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138485A1 (en) * 2008-12-03 2010-06-03 William Weiyeh Chow System and method for providing virtual web access
CN207197660U (en) * 2017-09-15 2018-04-06 北京为韵科技有限公司 A kind of distribution type optical fiber sensing equipment for speech recognition and patrol work attendance
CN111368535A (en) * 2018-12-26 2020-07-03 珠海金山网络游戏科技有限公司 Sensitive word recognition method, device and equipment
CN109948052A (en) * 2019-03-08 2019-06-28 上海七牛信息技术有限公司 A kind of internet information filtering auditing system, method and device
CN110225373A (en) * 2019-06-13 2019-09-10 腾讯科技(深圳)有限公司 A kind of video reviewing method, device and electronic equipment
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779250A (en) * 2021-09-08 2021-12-10 上海松欣智能科技有限公司 Standardized text data processing system
CN114257828A (en) * 2021-12-20 2022-03-29 上海哔哩哔哩科技有限公司 Live broadcast audit content processing method and system
CN114760490A (en) * 2022-04-15 2022-07-15 上海哔哩哔哩科技有限公司 Video stream processing method and device
CN114760490B (en) * 2022-04-15 2024-03-19 上海哔哩哔哩科技有限公司 Video stream processing method and device

Similar Documents

Publication Publication Date Title
CN112364216A (en) Edge node content auditing and filtering system and method
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN111198995B (en) Malicious webpage identification method
CN101692639A (en) Bad webpage recognition method based on URL
CN107862022B (en) Culture resource recommendation system
CN108287858A (en) The semantic extracting method and device of natural language
CN112347367B (en) Information service providing method, apparatus, electronic device and storage medium
CN101784022A (en) Method and system for filtering and classifying short messages
CN110472027A (en) Intension recognizing method, equipment and computer readable storage medium
CN101365104A (en) Program searching apparatus and program searching method
CN108876058B (en) News event influence prediction method based on microblog
CN111797820B (en) Video data processing method and device, electronic equipment and storage medium
CN109685153A (en) A kind of social networks rumour discrimination method based on characteristic aggregation
CN112883734B (en) Block chain security event public opinion monitoring method and system
CN111125484A (en) Topic discovery method and system and electronic device
KR101059557B1 (en) Computer-readable recording media containing information retrieval methods and programs capable of performing the information
CN111125429A (en) Video pushing method and device and computer readable storage medium
CN114329034A (en) Image text matching discrimination method and system based on fine-grained semantic feature difference
CN113657473B (en) Web service classification method based on transfer learning
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN106649730A (en) User clustering and short text clustering method based on social network short text stream
CN113821612A (en) Information searching method and device
CN109166012A (en) The method and apparatus of classification and information push for stroke predetermined class user
CN111651660B (en) Method for cross-media retrieval of difficult samples
CN109558531A (en) News information method for pushing, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination