WO2020010568A1 - 一种大数据人工智能分析装置 - Google Patents

一种大数据人工智能分析装置 Download PDF

Info

Publication number
WO2020010568A1
WO2020010568A1 PCT/CN2018/095415 CN2018095415W WO2020010568A1 WO 2020010568 A1 WO2020010568 A1 WO 2020010568A1 CN 2018095415 W CN2018095415 W CN 2018095415W WO 2020010568 A1 WO2020010568 A1 WO 2020010568A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
review
sensitive
packet
formatted
Prior art date
Application number
PCT/CN2018/095415
Other languages
English (en)
French (fr)
Inventor
陈钦鹏
Original Assignee
深圳齐心集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳齐心集团股份有限公司 filed Critical 深圳齐心集团股份有限公司
Priority to PCT/CN2018/095415 priority Critical patent/WO2020010568A1/zh
Publication of WO2020010568A1 publication Critical patent/WO2020010568A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications

Definitions

  • the invention relates to a big data artificial intelligence analysis device.
  • big data With the rapid development of science and technology, society and economy, the era of big data has begun.
  • the value source of big data has more comprehensive characteristics. From the data dimension, time dimension, spatial dimension, and cross-border data, they all converge.
  • the value of big data is to provide customers with more accurate and personalized services.
  • big data has great value potential in artificial intelligence such as language, vision, and prediction.
  • Artificial Intelligence the English abbreviation is AI. It is a new technological science that researches and develops theories, methods, technologies, and application systems for simulating, extending, and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new type of intelligent machine that can respond in a similar way to human intelligence. Research in this area includes robotics, language recognition, image recognition, Natural language processing and expert systems.
  • a big data artificial intelligence analysis device which includes:
  • a data acquisition unit for capturing a data packet of a node on the network online, and storing the data packet in a storage unit;
  • the data review unit is used to perform format conversion on the data packets captured by the data acquisition unit, and to review the formatted formatted data packets, and save the formatted formatted data packets and the review results to a storage unit;
  • the storage unit is configured to store the data packets captured by the data acquisition unit, the formatted data packets and review results generated by the data review unit, and the review set for review by the data review unit;
  • the cloud server is used to provide a review set used by the data review unit for review.
  • the review set includes a set of sensitive words used to review text and a set of sensitive images used to review images and videos;
  • the analysis device performs the following process:
  • Process 1 The cloud server updates the review collection into the storage unit in advance;
  • the data acquisition unit captures a set of data packets for a preset period of time on a node on the network at a preset time interval, and stores the data packet in a storage unit;
  • the data review unit reads the set of data packets to be audited in the storage unit, extracts one or more data packets, performs format conversion, and then formats the formatted data packets according to the review set stored in the storage unit. Perform a review, if the sensitive word or sensitive image is reviewed, mark the packet as unsuccessful; at the same time, save the formatted data packet and the review result to a storage unit in the process, and continue to extract the next data Package review until the entire package collection review is complete;
  • Process 4 Analyze the sensitive words or sensitive images examined by the data packets that fail the review in the packet collection, combine the sensitive words and sensitive images to determine the danger level according to the preset rules, and interrupt if the danger level is high
  • the transmission of the data packet may interrupt the data transmission of the node.
  • the method of determining the danger level is as follows: the weight of the sensitive words examined is ⁇ Ai ⁇ , i is the number of sensitive words, for example, the number of sensitive words reviewed is 3, then 3 The weights of each sensitive word are A1, A2, and A3. The weight of each sensitive word is known (stored in the database in advance, and the weights are greater than 0 and less than 1). The weight of the sensitive images reviewed is recorded. Is ⁇ Bj ⁇ , and j is the number of sensitive images. For example, the number of sensitive images reviewed is 5, and the weights of the 5 sensitive images are B1, B2, B3, B4, and B5, respectively.
  • the comprehensive weight S is calculated as follows:
  • the danger level is determined based on the value of the comprehensive weight S.
  • the data review unit performs format conversion on the data packet, which specifically includes the following process: adding a 64-bit header to the front of the data packet, wherein the first bit of the header is an identification bit, and 0 means that the review fails.
  • 1 is the examination pass
  • the second is the classification bit that identifies the examination result.
  • the data packet is classified according to the classification of the examination set.
  • the examination set contains information such as pro-Japanese anti-China information.
  • the mark it means that improper speech has been reviewed
  • the review set contains confidential information such as military intelligence
  • the mark it means that confidential information has been reviewed
  • the review set contains dangerous information such as explosive terrorist suicide, then The mark is 3, which indicates that the public order has been endangered.
  • the third to eighth bits (2-7) are reserved; the ninth to sixty Four digits (8-63) are the packet sequence number.
  • the packet sequence number is used to determine the position of the entire data where the packet is located (for transmission and decoding).
  • 8-31 is a total of 24 bits.
  • 32-63 are the lower 32 bits of the packet sequence number; the format of the formatted packet after format conversion is as follows:
  • the analysis device of the present invention does not disturb the normal transmission of data packets, only obtains the data packets from the nodes to be monitored and formats them, and the formatted data packets also carry the packet sequence number, which can effectively cooperate with the data review unit.
  • the system tracks and obtains the entire complete data information through the packet sequence number.
  • the data review unit reviews the formatted data packet after the format conversion according to the review set stored in the storage unit, and specifically includes:
  • the comparison process is as follows:
  • the content of the data packet includes N characters. Each character is matched with a sensitive word in the sensitive word set.
  • the matching algorithm can be the existing more mature KMP algorithm or LCA algorithm; if a sensitive word is matched, the formatted packet is considered to have failed the examination, and the packet header is The first identification bit of is recorded as 0, that is, the review fails; at the same time, the data packet is classified according to the preset classification where the matched sensitive words are located, and the result is included in the second identification bit of the header;
  • the original image is obtained according to the image encoding, and then the original image is compared with the comparison pictures in the sensitive image collection. If the comparison result has a similarity higher than a preset value (for example, the similarity is greater than 80%, it is considered that Similar), it is considered that there is a matching picture, it is considered that the formatted packet inspection fails.
  • a preset value for example, the similarity is greater than 80%, it is considered that Similar
  • the pictures are compared. If the similarity of the comparison result is higher than a preset value (for example, if the similarity is greater than 80%, the picture is considered similar), the picture is considered to be matched, and the formatted packet inspection is considered to have failed.
  • the analysis device of the present invention has two aspects of optimization:
  • the present invention does not disturb the normal transmission of the data packet, and only obtains the data packet from the node to be monitored and formats it.
  • the formatted data packet also carries the packet sequence number, which can effectively cooperate with the data review unit to work.
  • the system tracks and obtains the complete data information through the sequence number of the packet. After decoding, it can be further audited to verify the text, pictures, and videos in real time posted on the network;
  • this application can implement a comprehensive review of text, images, and videos, and obtain the comprehensive weight calculation formulas for sensitive words and sensitive images through experimental deduction and actual testing.
  • the comprehensive weight is greater than a preset value, it is considered to have a comprehensive weight
  • the high-level danger level directly terminates the transmission, thereby avoiding that in the censoring algorithms of the prior art, as long as sensitive words or sensitive images are retrieved, they are marked as abnormal and cannot be transmitted.
  • the invention provides a big data artificial intelligence analysis device, which includes: a plurality of data acquisition units, a plurality of data review units, a plurality of storage units, and a cloud server.
  • the data acquisition unit and the data review unit both establish a communication connection with the storage unit.
  • fetching data there are tens of thousands of nodes on the network.
  • the nodes are divided into different regions based on their physical location. Each region has a data acquisition unit, a data review unit, and a storage unit. All regions The data acquisition unit, data review unit, and storage unit all establish communication connections with the cloud server.
  • the data acquisition unit is configured to capture a data packet of a node on the network online and store the data packet in a storage unit;
  • the data review unit is used to perform format conversion on the data packets captured by the data acquisition unit, and to review the formatted formatted data packets, and save the formatted formatted data packets and the review results to a storage unit;
  • the storage unit is configured to store the data packets captured by the data acquisition unit, the formatted data packets and review results generated by the data review unit, and the review set for review by the data review unit;
  • the cloud server is used to provide a review set used by the data review unit for review.
  • the review set includes a set of sensitive words used to review text, and a set of sensitive images used to review images and videos.
  • the analysis device performs the following process:
  • Process 1 The cloud server updates the review collection into the storage unit in advance;
  • the data acquisition unit captures a set of data packets of a node on the network for a preset period of time at a preset interval and stores them in the storage unit; for example, the interval is 1 minute, and the data packets of the node are continuously captured for 10 seconds to form Data packet collection
  • the data review unit reads the set of data packets to be audited in the storage unit, extracts one or more data packets, performs format conversion, and then formats the formatted data packets according to the review set stored in the storage unit. Perform a review, if the sensitive word or sensitive image is reviewed, mark the packet as unsuccessful; at the same time, save the formatted data packet and the review result to a storage unit in the process, and continue to extract the next data Package review until the entire package collection review is complete;
  • the data review unit performs format conversion on the data packet, which specifically includes the following process: adding a 64-bit header to the front of the data packet, where the first bit of the header is the identification bit, 0 is the failure of the review, and 1 is the success of the review.
  • the second bit is the classification bit that identifies the review result.
  • the data packet is classified according to the classification of the reviewed review set. For example, if the review set contains information such as pro-Japanese anti-China information, the label is 1, indicating that When improper speech is reviewed, when the review set contains confidential information such as military intelligence, it is marked as 2 to indicate that confidential information has been reviewed.
  • the review set contains explosive information such as explosive terror suicide, it is marked as 3, indicating that It has been reviewed that the public order is endangered.
  • the third to eighth bits (2-7) are reserved;
  • the ninth to sixty-fourth bits (8- 63) is the packet sequence number.
  • the packet sequence number is used to determine the entire data location (for transmission and decoding) of the data packet.
  • 8-31 is a total of 24 bits, and 32-63 are Packet order The lower 32 bits; packet format formatted after format conversion is as follows:
  • the data review unit reviews the formatted data packet after format conversion according to the review set stored in the storage unit, which specifically includes:
  • the comparison process is as follows:
  • the content of the data packet includes N characters. Each character is matched with a sensitive word in the sensitive word set.
  • the matching algorithm can be the existing more mature KMP algorithm or LCA algorithm; if a sensitive word is matched, the formatted packet is considered to have failed the examination, and the packet header is The first identification bit of is recorded as 0, that is, the review fails; at the same time, the data packet is classified according to the preset classification where the matched sensitive words are located, and the result is included in the second identification bit of the header;
  • the original image is obtained according to the image encoding, and then the original image is compared with the comparison pictures in the sensitive image collection. If the comparison result has a similarity higher than a preset value (for example, the similarity is greater than 80%, it is considered that Similar), it is considered that there is a matching picture, it is considered that the formatted packet inspection fails.
  • a preset value for example, the similarity is greater than 80%, it is considered that Similar
  • a preset value for example, if the similarity is greater than 80%, it is considered similar
  • / dev / video0 is the video device
  • H263 is the video encoding
  • the output format is RTP
  • 192.168.1.103:5060 is the defined IP address and port
  • the SDP file corresponding to the code stream is redirected to / tmp / ffmpeg. sdp
  • the next step is to perform one-to-one matching comparison of sensitive image sets for image1.jpg and image2.jpg respectively.
  • image comparison technologies including algorithms used in this application are all existing technologies, and will not be described here.
  • Process 4 Analyze the sensitive words or sensitive images examined by the data packets that fail the review in the packet collection, combine the sensitive words and sensitive images to determine the danger level according to the preset rules, and interrupt if the danger level is high
  • the transmission of the data packet may interrupt the data transmission of the node.
  • it can also output event statistics reports and situation analysis to analysts, which will not be expanded here.
  • the method of determining the danger level is as follows: record the weight of the sensitive words under review as ⁇ Ai ⁇ , i is the number of sensitive words, for example, if the number of sensitive words under review is 3, the weight of 3 sensitive words A1, A2, and A3 respectively, the weight of each sensitive word is known (stored in the database in advance, and the weight is greater than 0 and less than 1), and the weight of the sensitive image under review is recorded as ⁇ Bj ⁇ , j is the number of sensitive images. For example, the number of sensitive images examined is 5, and the weights of the 5 sensitive images are B1, B2, B3, B4, and B5. Calculate the comprehensive weight S of the sensitive words and sensitive images.
  • the formula is as follows:
  • the optimal value range is 0.8-1.2 .
  • the danger level is determined based on the value of the comprehensive weight S. For example, if the comprehensive weight S> 0.45, the danger level is considered high, if the comprehensive weight 0.2 ⁇ S ⁇ 0.45, the danger level is considered intermediate, and if the comprehensive weight S ⁇ 0.2 is considered the danger level is low.
  • takes any value in the 0.8-1.2 interval so that S> 0.45.
  • the analysis device of the present invention is deployed in a network node in a bypass manner without changing the network topology, does not interfere with the normal business habits of users, and does not disturb the normal transmission of data packets. Therefore, it has the advantages of simple deployment and high stability.
  • the formatted data packet also carries the packet sequence number, which can effectively work with the data review unit. When the data packet fails the review, the system passes the packet. Sequence number tracking obtains the entire complete data message and decodes it.
  • the present invention determines the comprehensive weight through sensitive words and sensitive images, and determines the danger level according to the comprehensive weight.
  • the transmission of the node is cut off only when the danger level is high (the danger level can be flexibly determined according to the setting of parameters). Has very good adaptability.
  • the present invention allows network censors to conveniently look up various types of information, so it can be widely used in the fields of public opinion, network monitoring, and big data analysis.
  • the method of the present invention is not limited to being performed in the chronological order described in the specification, but may also be performed in other chronological order, in parallel, or independently. Therefore, the execution order of the methods described in this specification does not limit the technical scope of the present invention.

Abstract

一种大数据人工智能分析装置,其包括数据获取单元、数据审查单元、存储单元以及云端服务器,数据获取单元,用于在线抓取网络上的一个节点的数据包,并存储至存储单元中;数据审查单元用于对数据获取单元抓取的数据包进行格式转换,并对格式转换后的格式化数据包进行审查,并将格式转换后的格式化数据包以及审查结果保存至存储单元中;存储单元用于存储数据获取单元抓取的数据包、数据审查单元产生的格式化数据包以及审查结果、以及供数据审查单元审查的审查集合;云端服务器用于提供数据审查单元用于审查的审查集合,审查集合包括用于审查文本的敏感词集合,用于审查图像和视频的敏感图像集合。该装置部署简单、稳定性高。

Description

一种大数据人工智能分析装置 技术领域
本发明涉及一种大数据人工智能分析装置。
背景技术
随着科技、社会经济的迅猛发展,引发了大数据时代的到来,大数据的价值来源具有更加全面化的特点,从数据维度、时间维度、空间维度以及跨界的各种数据交汇在一起,大数据的价值体现在更为极致的为客户提供精准的个性化服务,此外,大数据在语言、视觉、预测等人工智能方面拥有巨大的价值潜力。
人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学。人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。
大数据和人工智能的发展带来了全方位的社会变革,同时也带来了新的问题和挑战。在当今数据爆炸的时代,任何人都可以在网络上发布文本、图片、语音以及视频等形式的言论,而并没有任何数据的审核即可在网上公开发布,导致了Elsagate变态儿童视频广泛流入国内各大视频网站长达两年之久(2016年-2018年初)而未被发现,给广大儿童带来极坏的负面影响。因此,数据审查不可避免的成为网络发展的制约。现有的数据审查,一般是根据人工来审核,例如网友上传到一个论坛的视频是否违法,一般有两种途径,一种是经过其他网友的举报,一种是在视频上传后经过后台管理人员一一观看审核后 再发布,显然后者不太现实,而前者则导致了大量的违法视频在被举报前被广泛传播。
因此,如何针对大数据的智能分析加入数据审查是现有网络发展亟待解决的技术问题之一。
发明内容
在下文中给出了关于本发明实施例的简要概述,以便提供关于本发明的某些方面的基本理解。应当理解,以下概述并不是关于本发明的穷举性概述。它并不是意图确定本发明的关键或重要部分,也不是意图限定本发明的范围。其目的仅仅是以简化的形式给出某些概念,以此作为稍后论述的更详细描述的前序。
根据本申请的一个方面,提供一种大数据人工智能分析装置,其包括:
若干个数据获取单元、若干个数据审查单元、多个存储单元以及云端服务器,数据获取单元和数据审查单元均与存储单元建立通讯连接,数据获取单元、数据审查单元、存储单元均与云端服务器建立通讯连接;其中,
数据获取单元,用于在线抓取网络上的一个节点的数据包,并存储至存储单元中;
数据审查单元用于对数据获取单元抓取的数据包进行格式转换,并对格式转换后的格式化数据包进行审查,并将格式转换后的格式化数据包以及审查结果保存至存储单元中;
存储单元用于存储数据获取单元抓取的数据包、数据审查单元产生的格式化数据包以及审查结果、以及供数据审查单元审查的审查集合;
云端服务器用于提供数据审查单元用于审查的审查集合,审查集合包括用于审查文本的敏感词集合,用于审查图像和视频的敏感图像集合;
该分析装置执行如下过程:
过程1:云端服务器预先将审查集合更新至存储单元内;
过程2:数据获取单元间隔预设时间在线抓取网络上的一个节点的预设时段的数据包集合,并存储至存储单元中;
过程3:数据审查单元读取存储单元中待审核的数据包集合,提取一个或多个数据包,对其进行格式转换,然后根据存储在存储单元的审查集合对格式转换后的格式化数据包进行审查,如果审查到敏感词或者敏感图像则将该数据包标记为审查不通过;同时在该过程中将格式转换后的格式化数据包以及审查结果保存至存储单元中,继续提取下一个数据包进行审查,直至整个数据包集合审查完毕;
过程4:分析数据包集合中标记审查不通过的数据包所审查到的敏感词或者敏感图像,将各敏感词和敏感图像组合起来根据预设规则判定危险等级,如果危险等级为高,则中断该数据包的传输,或者可中断该节点的数据传输。
进一步的,过程4中,判定危险等级的方法如下:将所审查到的敏感词的权重记为{Ai},i为敏感词的数量,例如审查到的敏感词个数为3个,则3个敏感词的权重分别为A1,A2和A3,每个敏感词的权重是已知的(预先存储在数据库中,且所有权重均大于0小于1),将所审查到的敏感图像的权重记为{Bj},j为敏感图像的数量,例如审查到的敏感图像个数为5个,则5个敏感图像的权重分别为B1,B2,B3,B4和B5,计算敏感词和敏感图像的综合权重S,计算公式如下:
[根据细则91更正 24.08.2018] 
Figure WO-DOC-FIGURE-2
其中,A为敏感词权重的均,值A=ΣAi/i,B为敏感图像权重的均值,B=ΣBj/j,μ为权重系数,取值范围为0.8-1.2。危险等级根据综合权重S的值来判定。
进一步的,过程3中,数据审查单元对数据包进行格式转换,具体包括如下过程:将数据包的前面加上64位的包头,其中包头的第一位为标识位,0为审查不通过,1为审查通过,第二位为标识审查结果的分类位,当审查不通过时,根据审查到的审查集合的分类来对该数据包进行分类,例如审查集合中具有亲日反华等信息,则标识为1,表示有审查到不当言论,审查集合中有涉及军事情报等机密信息时,则标识为2,表示有审查到机密信息,审查集合中有涉及到爆炸恐怖自杀等危险信息时,则标识为3,表示有审查到危害公共秩序,当然,用户可在此基础上进行扩展或者修改以用于其他数据标识;第三-八位(2-7)为保留位;第九-六十四位(8-63)为包顺序号,包顺序号用于决定该数据包所在整个数据的位置(传输、解码时用),其中,8-31共24位为包顺序号的高24位,32-63为包顺序号的低32位;格式转换后的格式化数据包格式如下:
Figure PCTCN2018095415-appb-000001
本发明的分析装置并不扰乱数据包的正常传输,仅从待监测节点获取数据包,并对其进行格式化,格式化数据包同时还携带包顺序号,能够有效地协同数据审查单元工作,当该数据包审查不通过时,系统则通过该包顺序号跟踪并获取整个完整的数据信息。
进一步的,所述过程3中,数据审查单元根据存储在存储单元的审查集合 对格式转换后的格式化数据包进行审查,具体包括:
判断格式化数据包的类型,如果是文本数据包,则将格式化数据包的除去包头的数据包内容与敏感词进行直接比对,比对过程如下:数据包内容包括N个字符,将N个字符分别与敏感词集合中的敏感词进行匹配,匹配算法可以是现有较为成熟的KMP算法或者LCA算法;如果有匹配到敏感词,则认为该格式化数据包审查不通过,则将包头的第一位标识位记为0,即审查不通过;同时根据匹配的敏感词所在的预设分类来对该数据包进行分类,并将结果计入包头的第二位标识位;
如果是图像数据,则根据图像编码获得原始图像,然后将该原始图像与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值(例如相似度大于80%则认为相似),则认为有匹配到对比图片,则认为该格式化数据包审查不通过;
如果是视频数据,则根据获得的数据包集合获得数据帧结构,根据数据帧结构分帧,取得合法的帧,然后调用FFmpeg或者DirectShow或者MCI逐帧抓取图像,分别与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值(例如相似度大于80%则认为相似),则认为有匹配到对比图片,则认为该格式化数据包审查不通过。
本发明的分析装置与现有技术相比,具有两方面的优化:
1、本发明并不扰乱数据包的正常传输,仅从待监测节点获取数据包,并对其进行格式化,格式化数据包同时还携带包顺序号,能够有效地协同数据审查单元工作,当该数据包审查不通过时,系统则通过该包顺序号跟踪并获得整个完整的数据信息,在解码后可通过进一步审核来实现对网络实时发布的文 本、图片以及视频等形式的言论进行审核;
2、此外,现有的审查要么仅针对文本,要么仅针对图像视频,并没有二者结合的审查,这样就导致了现有的审查过于严格进而使得大部分文本视频图像无法上传的现象;而本申请通过上述数据审查单元可实现对文本和图像视频的综合审查,并通过实验演绎和实际测验获得了敏感词和敏感图像的综合权重计算公式,当综合权重大于预设值时则认为其具有高级别的危险等级,则直接终止其传输,从而避免了现有技术的审查算法中只要检索到敏感词或者敏感图像则将其标记为异常使其无法传输。
具体实施方式
下面来说明本发明的实施例。
本发明提供一种大数据人工智能分析装置,其包括:若干个数据获取单元、若干个数据审查单元、多个存储单元以及云端服务器,数据获取单元和数据审查单元均与存储单元建立通讯连接。当抓取数据时,网络上划分有成千上万个节点,首先根据物理位置将节点划分为不同的区域,每个区域内具有一个数据获取单元、一个数据审查单元以及一个存储单元,所有区域的数据获取单元、数据审查单元、存储单元均与云端服务器建立通讯连接。
其中,数据获取单元,用于在线抓取网络上的一个节点的数据包,并存储至存储单元中;
数据审查单元用于对数据获取单元抓取的数据包进行格式转换,并对格式转换后的格式化数据包进行审查,并将格式转换后的格式化数据包以及审查结果保存至存储单元中;
存储单元用于存储数据获取单元抓取的数据包、数据审查单元产生的格式 化数据包以及审查结果、以及供数据审查单元审查的审查集合;
云端服务器用于提供数据审查单元用于审查的审查集合,审查集合包括用于审查文本的敏感词集合,用于审查图像和视频的敏感图像集合。
具体的,该分析装置执行如下过程:
过程1:云端服务器预先将审查集合更新至存储单元内;
过程2:数据获取单元间隔预设时间在线抓取网络上的一个节点的预设时段的数据包集合,并存储至存储单元中;例如间隔1分钟,持续抓取10秒该节点的数据包形成数据包集合;
过程3:数据审查单元读取存储单元中待审核的数据包集合,提取一个或多个数据包,对其进行格式转换,然后根据存储在存储单元的审查集合对格式转换后的格式化数据包进行审查,如果审查到敏感词或者敏感图像则将该数据包标记为审查不通过;同时在该过程中将格式转换后的格式化数据包以及审查结果保存至存储单元中,继续提取下一个数据包进行审查,直至整个数据包集合审查完毕;
其中,数据审查单元对数据包进行格式转换,具体包括如下过程:将数据包的前面加上64位的包头,其中包头的第一位为标识位,0为审查不通过,1为审查通过,第二位为标识审查结果的分类位,当审查不通过时,根据审查到的审查集合的分类来对该数据包进行分类,例如审查集合中具有亲日反华等信息,则标识为1,表示有审查到不当言论,审查集合中有涉及军事情报等机密信息时,则标识为2,表示有审查到机密信息,审查集合中有涉及到爆炸恐怖自杀等危险信息时,则标识为3,表示有审查到危害公共秩序,当然,用户可在此基础上进行扩展或者修改以用于其他数据标识;第三-八位(2-7)为保留 位;第九-六十四位(8-63)为包顺序号,包顺序号用于决定该数据包所在整个数据的位置(传输、解码时用),其中,8-31共24位为包顺序号的高24位,32-63为包顺序号的低32位;格式转换后的格式化数据包格式如下:
Figure PCTCN2018095415-appb-000002
其中,数据审查单元根据存储在存储单元的审查集合对格式转换后的格式化数据包进行审查,具体包括:
判断格式化数据包的类型,如果是文本数据包,则将格式化数据包的除去包头的数据包内容与敏感词进行直接比对,比对过程如下:数据包内容包括N个字符,将N个字符分别与敏感词集合中的敏感词进行匹配,匹配算法可以是现有较为成熟的KMP算法或者LCA算法;如果有匹配到敏感词,则认为该格式化数据包审查不通过,则将包头的第一位标识位记为0,即审查不通过;同时根据匹配的敏感词所在的预设分类来对该数据包进行分类,并将结果计入包头的第二位标识位;
如果是图像数据,则根据图像编码获得原始图像,然后将该原始图像与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值(例如相似度大于80%则认为相似),则认为有匹配到对比图片,则认为该格式化数据包审查不通过;
如果是视频数据,则根据获得的数据包集合获得数据帧结构,根据数据帧结构分帧,取得合法的帧,然后调用FFmpeg或者DirectShow或者MCI逐帧抓取图像,分别与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值(例如相似度大于80%则认为相似),则认为有匹配到对比图片, 则认为该格式化数据包审查不通过。
其中,以FFmpeg在Linux下的抓取图像采集为例,说明如下:
首先输入以下命令采集10秒钟视频,对video1linux1视频设备进行采集,采集QCIF(176*144)的视频,每秒8帧:
./ffmpeg -t 10 -f video1linux1 -s 176*144 -r 8 -i/dev/video0 -vcodec h263-f rtp rtp://192.168.1.103:5060>/tmp/ffmpeg.sdp
/dev/video0为视频设备,H263为视频编码,输出格式为RTP,192.168.1.103:5060为定义好的IP地址及端口,最后将该码流所对应的SDP文件重定向到/tmp/ffmpeg.sdp中;
然后输入如下命令将视频分解成图片序列:
ffmpeg-i/tmp/ffmpeg.sdp image%d.jpg
最后获得image1.jpg、image2.jpg等等。
下一步即可分别针对image1.jpg、image2.jpg进行敏感图像集合的一一匹配对比。本申请中使用的图像对比技术(包括算法)均为现有技术,这里也不再展开叙述。
过程4:分析数据包集合中标记审查不通过的数据包所审查到的敏感词或者敏感图像,将各敏感词和敏感图像组合起来根据预设规则判定危险等级,如果危险等级为高,则中断该数据包的传输,或者可中断该节点的数据传输。同时,还可输出事件统计报告和态势分析展示给分析人员,这里不再展开。
其中,判定危险等级的方法如下:将所审查到的敏感词的权重记为{Ai},i为敏感词的数量,例如审查到的敏感词个数为3个,则3个敏感词的权重分别 为A1,A2和A3,每个敏感词的权重是已知的(预先存储在数据库中,且所有权重均大于0小于1),将所审查到的敏感图像的权重记为{Bj},j为敏感图像的数量,例如审查到的敏感图像个数为5个,则5个敏感图像的权重分别为B1,B2,B3,B4和B5,计算敏感词和敏感图像的综合权重S,计算公式如下:
[根据细则91更正 24.08.2018] 
Figure WO-DOC-FIGURE-3
其中,A为敏感词权重的均值,A=ΣAi/i,B为敏感图像权重的均值,B=ΣBj/j,μ为权重系数,经过大量的实验获得其最佳取值范围为0.8-1.2。
危险等级根据综合权重S的值来判定。例如综合权重S>0.45,则认为危险等级为高,综合权重0.2≤S≤0.45,则认为危险等级为中级,综合权重S<0.2则认为危险等级为低。
例如敏感词权重的均值A=0.5,敏感图像权重的均值B=0.7,综合权重S=0.168+μ·0.464,此时μ取0.8-1.2区间任意值均使得S>0.45,则认为危险等级为高;当敏感词权重的均值A=0.5,敏感图像权重的均值B=0.2,综合权重S=0.431+μ·0.028,此时μ取0.8-1.2区间任意值均使得S>0.45,则认为危险等级为高;当敏感词权重的均值A=0.4,敏感图像权重的均值B=0.2,综合权重S=0.32+μ·0.004,此时μ取0.8-1.2区间任意值均使得0.2≤S≤0.45,认为危险等级为中级。审查人员可根据上述综合权重值和危险等级来决策进一步的操作。
本发明的分析装置以旁路方式部署在网络节点中,无需进行改变网络拓扑,不干扰用户正常业务习惯,并不扰乱数据包的正常传输,因此具有部署简单、稳定性高的优势,本发明仅从待监测节点获取数据包,并对其进行格式化,格式化数据包同时还携带包顺序号,能够有效地协同数据审查单元工作,当该 数据包审查不通过时,系统则通过该包顺序号跟踪获得整个完整的数据信息并解码。此外,本发明经过敏感词和敏感图像来判定综合权重,并根据综合权重来判定危险等级,当危险等级高(可根据参数的设置来弹性决定危险等级)时才将该节点的传输切断,因此具有非常好的适应能力。此外,本发明通过对各种敏感词进行归类,网络审查人员可方便的查阅各类信息,因此可广泛应用于舆情、网监、大数据分析等领域。
应该强调,术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在,但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。
此外,本发明的方法不限于按照说明书中描述的时间顺序来执行,也可以按照其他的时间顺序地、并行地或独立地执行。因此,本说明书中描述的方法的执行顺序不对本发明的技术范围构成限制。
尽管上面已经通过对本发明的具体实施例的描述对本发明进行了披露,但是,应该理解,上述的所有实施例和示例均是示例性的,而非限制性的。本领域的技术人员可在所附权利要求的精神和范围内设计对本发明的各种修改、改进或者等同物。这些修改、改进或者等同物也应当被认为包括在本发明的保护范围内。

Claims (8)

  1. 一种大数据人工智能分析装置,其特征在于:其包括:
    若干个数据获取单元、若干个数据审查单元、多个存储单元以及云端服务器,数据获取单元和数据审查单元均与存储单元建立通讯连接,数据获取单元、数据审查单元、存储单元均与云端服务器建立通讯连接;其中,
    数据获取单元,用于在线抓取网络上的一个节点的数据包,并存储至存储单元中;
    数据审查单元用于对数据获取单元抓取的数据包进行格式转换,并对格式转换后的格式化数据包进行审查,并将格式转换后的格式化数据包以及审查结果保存至存储单元中;
    存储单元用于存储数据获取单元抓取的数据包、数据审查单元产生的格式化数据包以及审查结果、以及供数据审查单元审查的审查集合;
    云端服务器用于提供数据审查单元用于审查的审查集合,审查集合包括用于审查文本的敏感词集合、以及用于审查图像和视频的敏感图像集合;
    该分析装置执行如下过程:
    过程1:云端服务器预先将审查集合更新至存储单元内;
    过程2:数据获取单元间隔预设时间在线抓取网络上的一个节点的预设时段的数据包集合,并存储至存储单元中;
    过程3:数据审查单元读取存储单元中待审核的数据包集合,提取一个或多个数据包,对其进行格式转换,然后根据存储在存储单元的审查集合对格式转换后的格式化数据包进行审查,如果审查到敏感词或者敏感图像则将该数据包标记为审查不通过;同时在该过程中将格式转换后的格式化数据包以及审查结果保存至存储单元中,继续提取下一个数据包进行审查,直至整个数据包集合审查完毕;
    过程4:分析数据包集合中标记审查不通过的数据包所审查到的敏感词或者敏感图像,将各敏感词和敏感图像组合起来根据预设规则判定危险等级,如果危险等级为高,则中断该数据包的传输。
  2. 根据权利要求1所述的大数据人工智能分析装置,其特征在于:所述过程3中,数据审查单元对数据包进行格式转换,具体包括如下过程:
    将数据包的前面加上64位的包头,转换后的格式化数据包如下:
    Figure PCTCN2018095415-appb-100001
    包头的第一位为标识位,标识审查不通过或者审查通过;第二位为标识审查结果的分类位,第三-八位为保留位;第九-六十四位为包顺序号。
  3. 根据权利要求1所述的大数据人工智能分析装置,其特征在于:
    所述过程3中,数据审查单元根据存储在存储单元的审查集合对格式转换后的格式化数据包进行审查,具体包括:
    判断格式化数据包的类型,如果是文本数据包,则将格式化数据包的除去包头的数据包内容与敏感词进行直接比对,如果有匹配到敏感词,则认为该格式化数据包审查不通过;
    如果是图像数据,则由数据包内容根据图像编码获得原始图像,然后将该原始图像与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值,则认为有匹配到对比图片,则认为该格式化数据包审查不通过;
    如果是视频数据,则根据获得的数据包集合获得数据帧结构,根据数据帧结构分帧,取得合法的帧,然后逐帧抓取图像,分别与敏感图像集合中的对比图片进行比对,如果比对结果相似度高于预设值,则认为有匹配到对比图片,则认为该格式化数据包审查不通过。
  4. 根据权利要求3所述的大数据人工智能分析装置,其特征在于:将格式化数据包的除去包头的数据包内容与敏感词进行直接比对,其比对过程如下:数据包内容包括N个字符,将N个字符分别与敏感词集合中的敏感词进行匹配,如果有匹配到敏感词,则认为该格式化数据包审查不通过,则将包头的第一位标识位记为0,即审查不通过;同时根据匹配的敏感词所在的预设分类来对该数据包进行分类,并将结果计入包头的第二位标识位。
  5. 根据权利要求3所述的大数据人工智能分析装置,其特征在于:逐帧抓取图像具体是调用FFmpeg或者DirectShow或者MCI逐帧抓取图像。
  6. [根据细则91更正 24.08.2018] 
    根据权利要求1-5任一所述的大数据人工智能分析装置,其特征在于:过程4中,判定危险等级的方法如下: 所审查到的敏感词的权重记为{Ai},i为敏感词的数量,0<Ai<1; 所审查到的敏感图像的权重记为{Bj},j为敏感图像的数量,0<Bj<1; 计算敏感词和敏感图像的综合权重S,计算公式如下:
    Figure WO-DOC-FIGURE-1
    其中,A为敏感词权重的均,值A=ΣAi/i,B为敏感图像权重的均值,B=ΣBj/j,μ为权重系数; 危险等级根据综合权重S的大小区间来判定。
  7. 根据权利要求6所述的大数据人工智能分析装置,其特征在于:μ的取值范围为0.8-1.2。
  8. 根据权利要求6所述的大数据人工智能分析装置,其特征在于:S的取值大于0.45时,即认为危险等级为高。
PCT/CN2018/095415 2018-07-12 2018-07-12 一种大数据人工智能分析装置 WO2020010568A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/095415 WO2020010568A1 (zh) 2018-07-12 2018-07-12 一种大数据人工智能分析装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/095415 WO2020010568A1 (zh) 2018-07-12 2018-07-12 一种大数据人工智能分析装置

Publications (1)

Publication Number Publication Date
WO2020010568A1 true WO2020010568A1 (zh) 2020-01-16

Family

ID=69142906

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095415 WO2020010568A1 (zh) 2018-07-12 2018-07-12 一种大数据人工智能分析装置

Country Status (1)

Country Link
WO (1) WO2020010568A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435842A (zh) * 2021-06-28 2021-09-24 京东科技控股股份有限公司 业务流程的处理方法和计算机设备
CN114339292A (zh) * 2021-12-31 2022-04-12 安徽听见科技有限公司 一种直播流的审查干预方法、装置、存储介质及设备
CN114936374A (zh) * 2022-05-20 2022-08-23 合肥亚慕信息科技有限公司 基于人工智能算法数据安全保护方法
CN116861198A (zh) * 2023-09-01 2023-10-10 北京瑞莱智慧科技有限公司 数据处理方法、装置及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1761203A (zh) * 2005-11-03 2006-04-19 上海交通大学 网上信息安全综合分析与监控系统
CN101035281A (zh) * 2007-04-19 2007-09-12 鲍东山 分级内容审核系统
CN102547794A (zh) * 2012-01-12 2012-07-04 郑州金惠计算机系统工程有限公司 Wap手机传媒色情图像、视频及不良内容的识别监管平台
CN103312770A (zh) * 2013-04-19 2013-09-18 无锡成电科大科技发展有限公司 一种云平台资源审核的方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1761203A (zh) * 2005-11-03 2006-04-19 上海交通大学 网上信息安全综合分析与监控系统
CN101035281A (zh) * 2007-04-19 2007-09-12 鲍东山 分级内容审核系统
CN102547794A (zh) * 2012-01-12 2012-07-04 郑州金惠计算机系统工程有限公司 Wap手机传媒色情图像、视频及不良内容的识别监管平台
CN103312770A (zh) * 2013-04-19 2013-09-18 无锡成电科大科技发展有限公司 一种云平台资源审核的方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435842A (zh) * 2021-06-28 2021-09-24 京东科技控股股份有限公司 业务流程的处理方法和计算机设备
CN114339292A (zh) * 2021-12-31 2022-04-12 安徽听见科技有限公司 一种直播流的审查干预方法、装置、存储介质及设备
CN114936374A (zh) * 2022-05-20 2022-08-23 合肥亚慕信息科技有限公司 基于人工智能算法数据安全保护方法
CN116861198A (zh) * 2023-09-01 2023-10-10 北京瑞莱智慧科技有限公司 数据处理方法、装置及存储介质

Similar Documents

Publication Publication Date Title
WO2020010568A1 (zh) 一种大数据人工智能分析装置
EP4123503A1 (en) Image authenticity detection method and apparatus, computer device and storage medium
CN106357618B (zh) 一种Web异常检测方法和装置
CN102724317B (zh) 一种网络数据流量分类方法和装置
CN104794170B (zh) 基于指纹多重哈希布隆过滤器的网络取证内容溯源方法和系统
CN109768952B (zh) 一种基于可信模型的工控网络异常行为检测方法
CN105843878A (zh) 一种it系统事件标准化实现方法
CN110781668B (zh) 文本信息的类型识别方法及装置
CN111090813B (zh) 一种内容处理方法、装置和计算机可读存储介质
US20130024389A1 (en) Method and apparatus for extracting business-centric information from a social media outlet
He et al. Deep-feature-based autoencoder network for few-shot malicious traffic detection
WO2020228527A1 (zh) 数据流的分类方法和报文转发设备
CN111177469A (zh) 人脸检索方法及人脸检索装置
CN115080756A (zh) 一种面向威胁情报图谱的攻防行为和时空信息抽取方法
Lin et al. A novel multimodal deep learning framework for encrypted traffic classification
CN102984242B (zh) 一种应用协议的自动识别方法和装置
CN108696713B (zh) 码流的安全测试方法、装置及测试设备
CN109728977A (zh) Jap匿名流量检测方法及系统
CN116828087B (zh) 基于区块链连接的信息安全系统
CN105553787B (zh) 基于Hadoop的边缘网出口网络流量异常检测方法
Altschaffel et al. Statistical pattern recognition based content analysis on encrypted network: Traffic for the teamviewer application
CN111914649A (zh) 人脸识别的方法及装置、电子设备、存储介质
CN114422515B (zh) 一种适配电力行业的边缘计算架构设计方法及系统
CN107517237A (zh) 一种视频识别方法和装置
CN116192527A (zh) 攻击流量检测规则生成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926364

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18926364

Country of ref document: EP

Kind code of ref document: A1