CN112364216A

CN112364216A - Edge node content auditing and filtering system and method

Info

Publication number: CN112364216A
Application number: CN202011321876.8A
Authority: CN
Inventors: 肖何; 王金高; 唐雅琴
Original assignee: Shanghai Jingxin Network Technology Co ltd
Current assignee: Shanghai Jingxin Network Technology Co ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-02-12

Abstract

The invention discloses a system and a method for auditing and filtering contents of edge nodes, and belongs to the technical field of Internet. The invention comprises an edge node content auditing and filtering system, which comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module. The invention also comprises a method for auditing and filtering the content of the edge node, which comprises the following steps of S1, collecting data at the edge node for transmission and storage; s2, screening and checking and sorting; s3, sensitive characteristic auditing and outputting are carried out, and the request is filtered or intercepted; s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data; and S5, adjusting the sensitive characteristics according to the detection result, and purifying the network environment. The invention improves the auditing efficiency, reduces the resource waste, can carry out system self-check and constantly ensures that the user has a healthy network environment.

Description

Edge node content auditing and filtering system and method

Technical Field

The invention relates to the technical field of Internet, in particular to a system and a method for auditing and filtering contents of edge nodes.

Background

With the rapid development of internet technology, internet users are increasing, network hotspots are coming up endlessly, and the network brings convenience to the users and is also full of a great deal of sensitive information, such as bloody fishy violence, anti-political theory, yellow information, gambling fraud, advertisement and the like. Therefore, how to filter the information requested or uploaded by the user to create a healthy and safe network environment becomes an important task in current internet technology.

In a current common scheme, a system filters sensitive information uploaded or received by a user from a large amount of user data according to a blacklist rule, then detects the user and implements prohibition. In addition, in the process of content auditing, a large amount of data is not classified, and the priority auditing specification is not provided, so that a user cannot receive hotspot information at the first time, and the time delay exists. Meanwhile, the system has no detection means, does not have clear cognition on the current network environment, has ambiguity on the control of sensitive characteristics, and seriously influences the user experience and the normal operation of the network environment.

Therefore, an accurate and efficient content auditing and filtering system is urgently needed, and the system can accurately reflect the current state of the network environment.

Disclosure of Invention

The present invention is directed to a system and a method for auditing and filtering contents of edge nodes, so as to solve the problems in the background art.

In order to solve the technical problems, the invention provides the following technical scheme:

an edge node content auditing and filtering system, characterized by: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;

the output end of the data acquisition module is electrically connected with the input ends of the data analysis module and the database; the output end of the data analysis module is electrically connected with the input ends of the database and the data auditing module; the output end of the data auditing module is electrically connected with the output ends of the detection module and the output module; the output end of the detection module is electrically connected with the input end of the database; the output end of the database is electrically connected with the input end of the updating module; the output end of the updating module is electrically connected with the input end of the data analysis module.

According to the technical scheme, the data acquisition module comprises a historical data acquisition unit and a real-time data acquisition unit;

the historical data acquisition unit acquires data under a normal historical network environment and stores the data in a database, and the real-time data acquisition unit acquires a request sent by a user in real time and transmits the request to the data analysis module.

According to the technical scheme, the data analysis module comprises a scheduling sorting factor unit and an access path recording unit;

the scheduling sorting factor unit comprises an uploading amount, a downloading amount, an appraising amount and a searching amount, and the uploading amount is set to be marked as a_i(ii) a The download amount is recorded as b_i(ii) a The evaluation quantity is recorded as c_i(ii) a The search quantity is recorded as d_i(ii) a Respectively form a set a ═ a₁，a₂，a₃，…，a_n}、b＝{b₁，b₂，b₃，…，b_n}、c＝{c₁，c₂，c₃，…，c_n}、d＝{d₁，d₂，d₃，…，d_nIn which a is_i、b_i、c_i、d_iAre all constant terms;

according to the formula:

K_i＝a_i+b_i+c_i+d_i-M

wherein K_iThe index value is a hotspot event index value; m is a hot event index threshold;

when K is_iIf the event is more than 0, the event is a hot event;

take all K_iSorting from large to small to form a set K ═ K₁，K₂，K₃，…，K_nIn which K is₁≥K₂≥K₃≥K_nThe corresponding data is transmitted to a data auditing module according to the sequence in the set;

and the access path recording unit is used for correlating the access path to the user sending the request, recording and storing the access path to the database.

According to the technical scheme, the data auditing module comprises a sensitive text information auditing unit and a sensitive image information auditing unit;

the sensitive text information auditing unit comprises a sensitive vocabulary library and a text detection unit; the sensitive image information auditing unit comprises a sensitive image library and an image detection unit.

According to the technical scheme, the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;

the sensitive vocabulary detection is compared according to a sensitive vocabulary library, if the sensitive vocabulary exists, sensitive word labeling is carried out on the request, and then the user request is filtered or intercepted;

the homophone detection carries out voice reading on a text requested by a user, records sound wave images, selects sensitive vocabulary sound wave images in a sensitive vocabulary library for comparison, carries out sensitive word labeling on the request under the condition of consistent comparison, judges whether the labeled vocabulary has other meanings by using an intelligent input method, judges that the labeled vocabulary has homophone sensitive words if the labeled vocabulary does not have other meanings, and further filters or intercepts the user request;

the method comprises the steps of detecting the similar words, carrying out stroke coding on a text requested by a user, respectively setting a horizontal character, a vertical character, a left-falling character, a point character and a turning character according to the stroke sequence and the number of the text, for example, a cross character, a character is marked as AB, coding the character, comparing the coded character with a sensitive vocabulary code, setting a similarity threshold value as N, carrying out sensitive vocabulary labeling on the request when the similarity value of the two groups of codes is not less than N, judging whether the labeled vocabulary is proper in the context by using an intelligent input method, judging the similar sensitive vocabulary if the labeled vocabulary is improper, and further filtering or intercepting the user request.

According to the technical scheme, the sensitive image library is used for storing and updating sensitive image data; the image detection unit comprises face recognition detection and body privacy part exposure detection;

the face recognition detection is used for comparing sensitive face images in a sensitive image library, for example, filtering or intercepting the user request under the condition that the comparison is consistent with that of political event persons;

and the naked detection of the body privacy part is used for carrying out frame extraction detection on the image, finding that most of the privacy part is naked, and filtering or intercepting the user request.

According to the technical scheme, the updating module comprises an updating unit and a replacing unit;

when the updating unit receives a signal for adjusting the system, a new sensitive vocabulary is called from the database for updating;

and the replacing unit replaces the new sensitive vocabulary and transmits the new sensitive vocabulary to the data analysis module.

According to the technical scheme, the output module comprises a filtering unit and an intercepting unit;

the filtering unit is used for filtering the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Get request of the user;

the intercepting unit is used for intercepting the sensitive data of the user request after the user request is audited by the data auditing module aiming at the Post request of the user.

A method for auditing and filtering contents of an edge node is characterized by comprising the following steps: the method comprises the following steps:

s1, collecting the real-time request of the user and the historical data of the network environment at the edge node, and storing the data in a database;

s2, calling a user real-time request, screening hot events according to a scheduling sorting factor, and preferentially entering an audit queue;

s3, performing sensitive characteristic audit on the real-time request, outputting, filtering under a Get request of a user, and intercepting under a Post request;

s4, performing health detection of the edge node network environment by using a naive Bayes model according to historical data;

and S5, according to the detection result, the network environment is adjusted by enhancing the sensitive information characteristics, and the user is ensured to have a healthy network environment.

According to the technical scheme, in the steps S4-S5, the health degree of the network environment is judged by using a naive Bayes model, so that the result is more accurate, specifically, the posterior probability is deduced by using the prior probability according to the accumulation of the original data, so that the result is more accurate;

according to the formula:

wherein h is a sensitive request, L is a real-time input request, P (h/L) is the probability that the real-time input request is the sensitive request, P (h) is the historical probability of the sensitive request under the healthy network environment, P (L/h) is the probability that the real-time input request appears in the sensitive request, and P (L) is the probability of inputting the real-time request, and P (L) is 1 because the real-time request is input in the scheme;

decomposing the real-time input request L to form independent condition sets Y ═ L₁，L₂，L₃，…，L_n}；

According to the formula:

P(L/h)＝P(L₁/h)*P(L₂/h)*P(L₃/h)*…*P(L_n/h)；

wherein P (L)_n/h) occurrence of independent condition L for sensitive request_nThe probability of (a) of (b) being,

according to the formula:

P(h/L)＝P(h)*P(L₁/h)*P(L₂/h)*P(L₃/h)*…*P(L_n/h)；

and calculating P (h/L), calculating a network environment health threshold value as X according to the historical health network environment data, and transmitting a signal to a database for system adjustment when the P (h/L) is greater than X and the network environment is an unhealthy network environment.

Compared with the prior art, the invention has the following beneficial effects:

1. the edge nodes are used for content auditing and filtering, so that the auditing speed is increased, the requirement on hardware is reduced, the required amount of bandwidth is reduced, the requirement of a user is met in time, and the method has great advantages compared with the prior content auditing and filtering system;

2. in the auditing process, the homophone is audited in a sound wave comparison mode, so that the auditing accuracy can be improved, bad users are prevented from receiving or releasing sensitive information by using homophone different character characteristics of sensitive words, and the comprehensiveness of auditing the sensitive information is greatly improved; meanwhile, the stroke coding mode is adopted to audit the shape and the proximity of characters, so that adverse users are prevented from expressing sensitive information by approximating the shape of the characters and influencing network environment;

3. a naive Bayes model is added in the system, a normal healthy network environment threshold value is established by utilizing collected original data, the real-time request is decomposed, the posterior probability is solved according to the prior probability by utilizing independent and irrelevant conditions, and the probability that the real-time request becomes a sensitive request is used as a standard for measuring whether the network environment is healthy or not, so that the stable and normal operation of the system is maintained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of a module structure of an edge node content auditing and filtering system according to the present invention;

FIG. 2 is a schematic diagram illustrating the steps of a method for auditing and filtering the content of an edge node according to the present invention;

FIG. 3 is a flow chart of a system and method for edge node content audit filtering in accordance with the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides the following technical solutions:

in this embodiment, an edge node content auditing and filtering system is characterized in that: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;

the scheduling sorting factor unit comprises an uploading amount, a downloading amount, an appraising amount and a searching amount, and the uploading amount is set to be marked as a_i(ii) a The download amount is recorded as b_i(ii) a The evaluation quantity is recorded as c_i(ii) a The search quantity is recorded as d_i；

The data acquisition module acquires 5 real-time requests, and the information is as follows:

request 1: upload quantity a₁Download amount b of 100₁20, evaluation c₁7, search quantity d₁＝500；

Request 2: upload quantity a₂60, download amount b₂10, amount of comments c₂1, search quantity d₂＝240；

Request 3: upload quantity a₃70, download amount b₃20, evaluation c₃10, search quantity d₃＝160；

Request 4: upload quantity a₄140, download amount b₄120, amount of comments c₄600, search quantity d₄＝6500；

Request 5: upload quantity a₅Download amount b 50₅6, amount of comments c₅7, search quantity d₅＝80；

Setting a hot spot event index threshold value M to be 300;

according to the formula:

K_i＝a_i+b_i+c_i+d_i-M

wherein K_iObtaining K of each request for the index value of the hotspot event_iThe values are respectively: k₁＝327；K₂＝11；K₃＝-40；K₄＝7060；K₅＝-157；

Judging that the request 1, the request 2 and the request 4 are hot events;

take all K_iSorting from big to small to form an audit sequence K₄，K₁，K₂，K₃，K₅The corresponding requests are transmitted to a data auditing module in a sequencing mode;

the sensitive text information auditing unit comprises a sensitive vocabulary library and a text detection unit; the sensitive image information auditing unit comprises a sensitive image library and an image detection unit;

the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;

the data acquisition module acquires that 4 of 5 real-time requests are text information and 1 is image information, and the information is as follows:

request 1: "search: is the user of the refrigerator held on his own? ";

request 2: "ask questions: how long bananas can be generally preserved? "

Request 3: "upload: TV drama "West Yong Ji". "

Request 4: "upload: you are a white eat. "

The sensitive vocabulary detection is compared according to a sensitive vocabulary library, and no sensitive vocabulary exists in 4 requests;

the homophonic character detection carries out voice reading on 4 requests in sequence, records sound wave images, selects sensitive vocabulary sound wave images in a sensitive vocabulary library for comparison, finds that the vocabulary 'eating without voice' is consistent with the sensitive vocabulary 'dementia', carries out sensitive word labeling on the requests, judges by using an intelligent input method, finds that the words mean expressing 'dementia' in the current context, judges as homophonic sensitive words, and further intercepts the requests 4;

the method comprises the steps of carrying out stroke coding on 4 request texts, respectively setting horizontal marks A, vertical marks B, left-falling marks C, point marks D and turning marks E, coding according to the stroke sequence and the number of the texts, finding that words 'user' is coded as 'DEAC', sensitive words 'dead' are coded as 'EAC', setting a coding similarity value threshold value N to be 1, carrying out sensitive word marking on a request 1 if the two groups of coding similarity values are not less than 1, judging by using an intelligent input method, finding that sentences are smooth and have no similar sensitive words, and further carrying out normal output on the request 1.

the data acquisition module acquires 1 of 5 real-time requests as image information, and finds that no sensitive feature exists after face recognition detection and body privacy part exposure detection are carried out on the information, so that the request 5 is normally output.

According to the technical scheme, the updating module comprises an updating unit and a replacing unit, and when the updating unit receives a signal for adjusting the system, a new sensitive vocabulary is called from the database for updating; and the replacing unit replaces the new sensitive vocabulary and transmits the new sensitive vocabulary to the data analysis module.

The invention provides a method for auditing and filtering contents of edge nodes, which is characterized by comprising the following steps: the method comprises the following steps:

according to the formula:

in the embodiment, 100 pieces of random data in the historical healthy network environment are extracted, wherein 5 pieces of sensitive data are extracted;

extracting 10000 condition factors in sensitive data in a healthy network environment, and performing condition decomposition on a real-time input request L to obtain Y ═ L₁，L₂，L₃，L₄，L₅}; condition factor L₁9000 occurrences in sensitive data; condition factor L₂The number of occurrences in sensitive data was 8000; condition factor L₃The number of occurrences in sensitive data was 7000; condition factor L₄Number of occurrences in sensitive data8000 times; condition factor L₅The number of occurrences in sensitive data was 8000;

according to the formula:

according to the formula:

according to the historical healthy network environment, the health threshold value of the network environment is calculated to be X which is 0.002, and 0.016128 is larger than 0.002, so that the network environment is an unhealthy network environment at the moment, and the detection module transmits a signal to the database to carry out system adjustment.

The working principle of the invention is as follows: the method comprises the steps of utilizing edge nodes to carry out content auditing and filtering, greatly improving auditing speed, reducing requirements on hardware, reducing bandwidth demand and meeting requirements of users in time, utilizing a data acquisition module to acquire historical data and real-time data, utilizing a database to call and store, utilizing a data analysis module to sort the priority of the data, utilizing a data auditing module to audit sensitive characteristics, utilizing an output module to filter or intercept user requests, utilizing a detection module to carry out health and stability verification on a system based on a naive Bayesian model, utilizing an updating module to update and replace the data in time under the condition of unhealthy network environment and ensuring stable operation of the system;

in the auditing process, the homophone is audited in a sound wave comparison mode, so that the auditing accuracy can be improved, bad users are prevented from receiving or releasing sensitive information by using homophone different character characteristics of sensitive words, and the comprehensiveness of auditing the sensitive information is greatly improved; meanwhile, the stroke coding mode is adopted to audit the shape and the proximity characters, so that adverse users are prevented from expressing sensitive information by approximating the shape of the Chinese characters, and the influence on the network environment is avoided.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An edge node content auditing and filtering system, characterized by: the content auditing and filtering system comprises a data acquisition module, a data analysis module, a database, a data auditing module, a detection module, an updating module and an output module, wherein the data acquisition module is used for receiving user requests and acquiring historical data, the data analysis module is used for sorting the priority of data content, the database is used for storing data, the data auditing module is used for auditing the sensitive characteristics of the data, the detection module is used for detecting the state of a network environment, the updating module is used for updating and replacing the data, and the output module is used for outputting a final result;

2. The edge node content auditing and filtering system of claim 1, characterized by: the data acquisition module comprises a historical data acquisition unit and a real-time data acquisition unit;

3. The edge node content auditing and filtering system of claim 1, characterized by: the data analysis module comprises a scheduling sorting factor unit and an access path recording unit;

according to the formula:

K_i＝a_i+b_i+c_i+d_i-M

when K is_iIf the event is more than 0, the event is a hot event;

4. The edge node content auditing and filtering system of claim 1, characterized by: the data auditing module comprises a sensitive text information auditing unit and a sensitive image information auditing unit;

5. The edge node content auditing and filtering system of claim 4, characterized by: the sensitive vocabulary library is used for storing and updating sensitive vocabulary data; the text detection unit comprises sensitive vocabulary detection, homophone detection and similar character detection;

6. The edge node content auditing and filtering system of claim 4, characterized by: the sensitive image library is used for storing and updating sensitive image data; the image detection unit comprises face recognition detection and body privacy part exposure detection;

7. The edge node content auditing and filtering system of claim 1, characterized by: the updating module comprises an updating unit and a replacing unit;

8. The edge node content auditing and filtering system of claim 1, characterized by: the output module comprises a filtering unit and an intercepting unit;

9. A method for auditing and filtering contents of an edge node is characterized by comprising the following steps: the method comprises the following steps:

s1, collecting the real-time user input request and the network environment historical data at the edge node, and storing the collected data in a database;

10. The method for auditing and filtering contents of an edge node according to claim 9, characterized by: in steps S4-S5, the health degree of the network environment is determined by using a naive bayes model, so that the result is more accurate, specifically, the result is more accurate by using the prior probability to deduce the posterior probability according to the accumulation of the original data;

according to the formula:

wherein h is a sensitive request, L is a real-time input request, P (h/L) is the probability that the real-time input request is the sensitive request, P (h) is the historical probability of the sensitive request under the healthy network environment, P (L/h) is the probability that the real-time input request appears in the sensitive request, and P (L) is the probability of inputting the real-time request, and is marked as 1;

decomposing the real-time input request L to form independent condition sets Y{L₁，L₂，L₃，…，L_n}；

According to the formula:

P(L/h)＝P(L₁/h)*P(L₂/h)*P(L₃/h)*...*P(L_n/h)；

according to the formula:

P(h/L)＝P(h)*P(L₁/h)*P(L₂/h)*P(L₃/h)*…*P(L_n/h)；