CN114896522A

CN114896522A - Multi-platform information epidemic situation risk assessment method and device

Info

Publication number: CN114896522A
Application number: CN202210382759.5A
Authority: CN
Inventors: 吴俊杰; 殷博文; 杜文宇; 何熙; 杨智尧
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-08-12
Anticipated expiration: 2042-04-14
Also published as: CN114896522B

Abstract

The invention discloses a multi-platform information epidemic situation risk assessment method, which comprises the following steps: the method comprises the following steps: collecting flow data of each platform; step two: extracting a domain name list for each piece of stream data to obtain domain name redirection historical information, and matching the domain name redirection historical information with a domain name reliability corpus to obtain a reliability label of the stream data; step three: analyzing the user-defined position of each piece of streaming data to obtain geographic information; step four: grouping the stream data according to two dimensions of geographic information and time; step five: quantifying a static information epidemic situation risk index value based on the number of fans of the user and the reliability label for each group of flow data; step six: and quantifying the dynamic information epidemic situation risk index values for each group of stream data based on the praise number, the forwarding number, the comment number and the reliability label. The invention also provides an evaluation device. The invention reflects the upper risk limit and the information epidemic degree of the information epidemic by constructing the static information epidemic risk index and the dynamic information epidemic risk index.

Description

Multi-platform information epidemic risk assessment method and device

技术领域technical field

本发明涉及数据挖掘相关技术领域。更具体地说，本发明涉及一种多平台信息疫情风险评估方法及装置。The invention relates to the technical field of data mining. More specifically, the present invention relates to a multi-platform information epidemic risk assessment method and device.

背景技术Background technique

突如其来的新冠肺炎疫情给世界带来了巨大冲击。这种冲击不仅涉及物理空间，还蔓延到了网络空间，出现了以假新闻泛滥为特色的“信息疫情”。信息疫情不仅严重干扰疫情防控，还威胁地区的安全与稳定。如果能够实时对各个地区信息疫情风险进行评估，对于线上线下的疫情防控都有比较大的帮助。因此，亟需设计一种有效的信息疫情风险评估方法及装置。The sudden outbreak of COVID-19 has brought a huge impact on the world. This shock not only involves physical space, but also spreads to cyberspace, with the emergence of an “information epidemic” characterized by the proliferation of fake news. The information epidemic not only seriously interferes with epidemic prevention and control, but also threatens regional security and stability. If the information epidemic risk of each region can be assessed in real time, it will be of great help for online and offline epidemic prevention and control. Therefore, there is an urgent need to design an effective information epidemic risk assessment method and device.

发明内容SUMMARY OF THE INVENTION

本发明的一个目的是提供一种多平台信息疫情风险评估方法及装置，通过构建静态信息疫情风险指标和动态信息疫情风险指标，反映了信息疫情的风险上限和信息疫情程度。One object of the present invention is to provide a multi-platform information epidemic risk assessment method and device, which can reflect the information epidemic risk upper limit and information epidemic degree by constructing static information epidemic risk indicators and dynamic information epidemic risk indicators.

为了实现本发明的这些目的和其它优点，根据本发明的一个方面，本发明提供了多平台信息疫情风险评估方法，包括：步骤一：采集各平台的流数据；步骤二：对每条流数据，提取域名列表，获得域名重定向历史信息，并与域名可靠性语料库匹配，获得流数据的可靠性标签；步骤三、对每条流数据，解析用户自定义位置，获得地理信息；步骤四、按照地理信息、时间两个维度对流数据进行分组；步骤五、对每组流数据，基于用户粉丝数、可靠性标签，量化得到反映一个地区在某一时间段内静态信息疫情风险程度的静态信息疫情风险指标值；步骤六、对每组流数据，基于点赞数、转发数、评论数、可靠性标签，量化得到反映一个地区在某一时间段内动态信息疫情风险程度的动态信息疫情风险指标值。In order to achieve these objects and other advantages of the present invention, according to one aspect of the present invention, the present invention provides a multi-platform information epidemic risk assessment method, including: step 1: collecting flow data of each platform; step 2: analyzing each flow data , extract the domain name list, obtain the domain name redirection history information, and match it with the domain name reliability corpus to obtain the reliability label of the flow data; Step 3: For each piece of flow data, analyze the user-defined location to obtain geographic information; Step 4, The flow data is grouped according to the two dimensions of geographic information and time; step 5, for each group of flow data, based on the number of users and reliability labels, quantify the static information that reflects the epidemic risk level of static information in a region within a certain period of time. Epidemic risk index value; Step 6: For each group of streaming data, based on the number of likes, retweets, comments, and reliability labels, quantify the dynamic information epidemic risk that reflects the degree of dynamic information epidemic risk in a region within a certain period of time Index value.

进一步地，还包括：收集多个来源的域名可靠性标注数据；构建多类可靠性标签，并赋予不可靠性分值，将各个来源的域名可靠性标注数据映射到可靠性标签上；根据可靠性标签合并各个来源的域名可靠性标注数据，形成域名可靠性语料库，对于有多个可靠性标签的域名可靠性标注数据，采用不可靠性分值最低的作为可靠性标签。Further, it also includes: collecting domain name reliability labeling data from multiple sources; constructing multi-type reliability labels, assigning unreliability scores, and mapping domain name reliability labeling data from various sources to reliability labels; The reliability label combines the domain name reliability labeling data from various sources to form a domain name reliability corpus. For the domain name reliability labeling data with multiple reliability labels, the reliability label with the lowest unreliability score is used.

进一步地，采用公式1计算静态信息疫情风险指标值staticIRI_c，d：Further, formula 1 is used to calculate the static information epidemic risk index value staticIRI _{c, d} :

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，fans_i表示流数据i的发帖人具有的粉丝数量，r_i表示流数据i的不可靠性分值。Among them, c represents a region, d represents a time period, T _{c, d} represents all stream data with reliability labels in region c during time period d, fans _i represents the number of followers of the poster of stream data i, r _i represents the unreliability score of stream data i.

进一步地，采用公式2计算动态信息疫情风险指标值dynamicIRI_c，d：Further, formula 2 is used to calculate the dynamic information epidemic risk index value dynamicIRI _{c, d} :

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，like_i表示流数据i的点赞数，r_i表示流数据i的不可靠性分值。Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in the region c during the d time period, like _i represents the number of likes of stream data _i , and ri represents stream data Unreliability score for data i.

进一步地，采用正则表达式进行域名提取，采用异步爬虫获取域名重定向历史信息。Further, a regular expression is used to extract the domain name, and an asynchronous crawler is used to obtain the domain name redirection history information.

进一步地，采用地理服务供应商提供的地理编码服务解析用户自定义位置，获得地理信息。Further, the user-defined location is analyzed by using the geocoding service provided by the geographic service provider to obtain geographic information.

进一步地，时间分组的粒度为天、周或月，地理信息分组的粒度为市、省或国家。Further, the granularity of time grouping is day, week or month, and the granularity of geographic information grouping is city, province or country.

进一步地，还包括：获取历史时间内各时间段内对应的静态信息疫情风险指标值和动态信息疫情风险指标值，以静态信息疫情风险指标值为输入，以动态信息疫情风险指标值为输出，训练获得神经网络预测模型；将当前时间段的静态信息疫情风险指标值输入神经网络预测模型，输出动态信息疫情风险指标值的预测值，将预测值与当前时间段的动态信息疫情风险指标值进行比较，若误差超过第一预定范围，则标记当前时间段的动态信息疫情风险指标值，若误差超过第二预定范围，则重新采集当前时间段的流数据。Further, it also includes: obtaining the static information epidemic risk index value and dynamic information epidemic risk index value corresponding to each time period in the historical time, taking the static information epidemic risk index value as the input, and taking the dynamic information epidemic risk index value as the output, Training to obtain a neural network prediction model; input the static information epidemic risk index value of the current time period into the neural network prediction model, output the predicted value of the dynamic information epidemic risk index value, and compare the predicted value with the dynamic information epidemic risk index value of the current time period. By comparison, if the error exceeds the first predetermined range, the dynamic information epidemic risk index value of the current time period is marked, and if the error exceeds the second predetermined range, the flow data of the current time period is collected again.

根据本发明的另一个方面，还提供了多平台信息疫情风险评估装置，包括：采集模块，用于收集各平台的流数据；标签赋予模块，用于对每条流数据，提取域名列表，获得域名重定向历史信息，并与域名可靠性语料库匹配，赋予可靠性标签；地理信息解析模块，用于对每条流数据，解析用户自定义位置，获得地理信息；分组模块，用于按照地理信息、时间两个维度对流数据进行分组；静态信息疫情风险指标值计算模块，用于对每组流数据，基于用户粉丝数、可靠性标签，量化得到反映一个地区在某一时间段内静态信息疫情风险程度的静态信息疫情风险指标值；动态信息疫情风险指标值计算模块，用于对每组流数据，基于点赞数、转发数、评论数、可靠性标签，量化得到反映一个地区在某一时间段内动态信息疫情风险程度的动态信息疫情风险指标值。According to another aspect of the present invention, a multi-platform information epidemic risk assessment device is also provided, including: a collection module for collecting flow data of each platform; a label assigning module for extracting a list of domain names for each piece of flow data, and obtaining Domain name redirection history information, and match with the domain name reliability corpus, and assign reliability labels; geographic information analysis module, used to parse user-defined location for each stream data, and obtain geographic information; grouping module, used to classify according to geographic information The flow data is grouped in the two dimensions of time and time; the static information epidemic risk index value calculation module is used to quantify the static information epidemic in a region within a certain period of time based on the number of users and reliability labels for each group of flow data. The static information epidemic risk index value of the risk level; the dynamic information epidemic risk index value calculation module is used to quantify the number of likes, retweets, comments, and reliability labels for each group of stream data to reflect that an area is in a certain area. The dynamic information epidemic risk index value of the dynamic information epidemic risk degree within the time period.

进一步地，静态信息疫情风险指标值计算模块采用公式1计算静态信息疫情风险指标值staticIRI_c，d：Further, the static information epidemic risk index value calculation module adopts formula 1 to calculate the static information epidemic risk index value staticIRI _{c, d} :

动态信息疫情风险指标值计算模块采用公式2计算动态信息疫情风险指标值dynamicIRI_c，d：The dynamic information epidemic risk index value calculation module uses formula 2 to calculate the dynamic information epidemic risk index value dynamicIRI _{c, d} :

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，like_i表示流数据i的点赞数，fans_i表示流数据i的发帖人具有的粉丝数量，r_i表示流数据i的不可靠性分值。Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in region c during the d time period, like _i represents the number of likes of stream data i, and fans _i represents stream data The number of followers that the poster of data _i has, and ri represents the unreliability score of stream data i.

本发明至少包括以下有益效果：The present invention includes at least the following beneficial effects:

本发明综合考虑多平台的共性，构建了静态信息疫情风险指标、动态信息疫情风险指标，来综合评估各个平台的信息疫情风险程度。静态指标反映了信息疫情的风险上限，动态指标评估了当前各个平台的信息疫情程度；本发明基于重定向技术的信息疫情风险计算方法相较之前传统的简单匹配的信息疫情计算方法，结果更加准确全面。The invention comprehensively considers the commonality of multiple platforms, and constructs a static information epidemic risk index and a dynamic information epidemic risk index to comprehensively evaluate the information epidemic risk degree of each platform. The static index reflects the upper limit of the risk of the information epidemic, and the dynamic index evaluates the current information epidemic degree of each platform; the information epidemic risk calculation method based on the redirection technology of the present invention is more accurate than the previous traditional simple matching information epidemic calculation method. comprehensive.

本发明的其它优点、目标和特征将部分通过下面的说明体现，部分还将通过对本发明的研究和实践而为本领域的技术人员所理解。Other advantages, objects, and features of the present invention will appear in part from the description that follows, and in part will be appreciated by those skilled in the art from the study and practice of the invention.

附图说明Description of drawings

图1为本发明的框架图。FIG. 1 is a frame diagram of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明做进一步的详细说明，以令本领域技术人员参照说明书文字能够据以实施。The present invention will be further described in detail below with reference to the accompanying drawings, so that those skilled in the art can implement it with reference to the description.

应当理解，本文所使用的诸如“具有”、“包含”以及“包括”术语并不排除一个或多个其它元件或其组合的存在或添加。It should be understood that terms such as "having", "comprising" and "including" as used herein do not exclude the presence or addition of one or more other elements or combinations thereof.

如图1所示，本申请的实施例提供了多平台信息疫情风险评估方法，包括：步骤一：采集各平台的流数据，平台包括知乎、微博、twitter、Facebook、telegram等，每条流数据至少包括该条流数据的文本内容、发布时间、用户自定义位置、用户粉丝数、点赞数等信息；As shown in FIG. 1 , the embodiment of the present application provides a multi-platform information epidemic risk assessment method, including: Step 1: Collect streaming data of each platform, the platforms include Zhihu, Weibo, twitter, Facebook, telegram, etc. Stream data at least include the text content, release time, user-defined location, number of followers, number of likes and other information of the stream data;

可选地，每日以“COVID-19”、“Virus”等关键词收集在微博、facebook、twitter和telegram等社交平台的流数据，服务器利用传输层ZMQ中的pull-push方案将每日的流数据平均分发到服务器各端口；在服务器接收到流数据后，分别将各通道流数据存储到对应表中，即每个通道都设置有各自的一张流数据存储表；数据库可以采用Elasticsearch分布式存储数据库，实现大规模数据快速处理存储；各个平台拥有的数据至少包括mid、text、time、user_location、user_fans、like_num；mid代表当前流数据的id，text代表当前流数据文本内容，time代表当前流数据发表时间，user_location代表当前流数据发帖人的自定义位置，user_fans代表当前流数据发帖人粉丝数，like_num代表当前流数据点赞数量；Optionally, daily stream data collected on social platforms such as Weibo, facebook, twitter, and telegram with keywords such as "COVID-19" and "Virus", the server uses the pull-push scheme in the transport layer ZMQ to The stream data is evenly distributed to each port of the server; after the server receives the stream data, it stores the stream data of each channel in the corresponding table, that is, each channel has its own stream data storage table; the database can use Elasticsearch Distributed storage database to realize fast processing and storage of large-scale data; the data owned by each platform at least include mid, text, time, user_location, user_fans, like_num; mid represents the id of the current stream data, text represents the text content of the current stream data, and time represents The current streaming data publishing time, user_location represents the user-defined location of the current streaming data poster, user_fans represents the number of fans of the current streaming data poster, and like_num represents the current streaming data likes number;

步骤二：对每条流数据，提取域名列表，获得域名重定向历史信息，并与域名可靠性语料库匹配，获得流数据的可靠性标签；提取文本内容所包含的域名列表，在提取域名列表后，进一步获得域名重定向历史信息；之后，将该条流数据的域名重定向历史信息与域名可靠性语料库匹配，获得该条流数据的可靠性标签；Step 2: For each piece of flow data, extract the domain name list, obtain the domain name redirection history information, and match with the domain name reliability corpus to obtain the reliability label of the flow data; extract the domain name list contained in the text content, after extracting the domain name list. , and further obtain the domain name redirection history information; then, match the domain name redirection history information of the stream data with the domain name reliability corpus to obtain the reliability label of the stream data;

步骤三：对每条流数据，解析用户自定义位置，获得地理信息；Step 3: For each stream data, analyze the user-defined location to obtain geographic information;

步骤四：按照地理信息、时间两个维度对流数据进行分组；在数据库中，根据time、location对所有数据进行分组，保证每组的time、location字段是相同的；Step 4: Group the streaming data according to the two dimensions of geographic information and time; in the database, group all data according to time and location to ensure that the time and location fields of each group are the same;

步骤五：对每组流数据，基于用户粉丝数、可靠性标签中的一种或多种，量化得到反映一个地区在某一时间段内静态信息疫情风险程度的静态信息疫情风险指标值(风险度)；Step 5: For each group of streaming data, based on one or more of the number of users and reliability labels, quantify the static information epidemic risk index value (risk Spend);

步骤六：对每组流数据，基于点赞数、转发数、评论数、可靠性标签中的一种或多种，量化得到反映一个地区在某一时间段内动态信息疫情风险程度的动态信息疫情风险指标值(风险度)；Step 6: For each set of streaming data, based on one or more of the likes, retweets, comments, and reliability labels, quantify the dynamic information that reflects the epidemic risk level of dynamic information in a region within a certain period of time Epidemic risk index value (risk degree);

可以看出，鉴于不同社交平台数据格式和内容差异很大，，本实施例综合考虑多平台的共性，构建了静态信息疫情风险指标、动态信息疫情风险指标，来综合评估各个平台的信息疫情风险程度；静态指标反映了信息疫情的风险上限，动态指标评估了当前各个平台的信息疫情程度；基于重定向技术的信息疫情风险计算方法相较之前传统的简单匹配的信息疫情计算方法，结果更加准确全面。It can be seen that, in view of the great differences in the data format and content of different social platforms, this embodiment comprehensively considers the commonality of multiple platforms, and constructs static information epidemic risk indicators and dynamic information epidemic risk indicators to comprehensively evaluate the information epidemic risk of each platform. The static index reflects the upper limit of the information epidemic risk, and the dynamic index evaluates the current information epidemic degree of each platform; the information epidemic risk calculation method based on redirection technology is more accurate than the traditional simple matching information epidemic calculation method. comprehensive.

针对虚假信息判断中存在的依赖域名可靠性语料库、虚假域名有“马甲”等问题，建立了整合不同来源域名标记数据的方法，引入了重定向技术进行域名匹配，使得虚假信息判断更加严谨准确。该指标体系不仅可以评估信息疫情风险程度，更可深入解释信息疫情演化背后的作用机制。Aiming at the problems of relying on the domain name reliability corpus and false domain names with "vesticles" in the judgment of false information, a method of integrating the marked data of domain names from different sources is established, and the redirection technology is introduced for domain name matching, which makes the judgment of false information more rigorous and accurate. This indicator system can not only assess the risk of information epidemics, but also deeply explain the mechanism behind the evolution of information epidemics.

在另一些实施例中，还包括：收集多个来源的域名可靠性标注数据；构建多类可靠性标签，并赋予不可靠性分值，将各个来源的域名可靠性标注数据映射到可靠性标签上；根据可靠性标签合并各个来源的域名可靠性标注数据，形成域名可靠性语料库，对于有多个可靠性标签的域名可靠性标注数据，采用不可靠性分值最低的作为可靠性标签；In other embodiments, the method further includes: collecting domain name reliability labeling data from multiple sources; constructing multi-type reliability labels, assigning unreliability scores, and mapping domain name reliability labeling data from various sources to reliability labels above; merge the domain name reliability labeling data from various sources according to the reliability label to form a domain name reliability corpus. For the domain name reliability labeling data with multiple reliability labels, the one with the lowest unreliability score is used as the reliability label;

可选地，先收集学术界广泛使用的权威域名可靠性标注数据，其中，域名可靠性标注数据包括mediabiasfactcheck、decodex等，其主要格式为域名-可靠性标签键值对，如{“79Days.News”：阴谋论}、{“lung.ory”：科学}；此处只是举例，但是不限于上述的两个域名可靠性标注数据，符合该类的域名可靠性标注数据皆在保护范围内；Optionally, first collect authoritative domain name reliability labeling data widely used in academia, where domain name reliability labeling data includes mediabiasfactcheck, decodex, etc., and its main format is a domain name-reliability label key-value pair, such as {“79Days.News ”: conspiracy theory}, {“lung.ory”: science}; this is just an example, but not limited to the above two domain name reliability annotation data, the domain name reliability annotation data that conforms to this category are all within the scope of protection;

构建统一的可靠性标签，包括“科学”、“主流媒体”、“讽刺”、“标题党”、“其他”、“政治”、“虚假信息”、“伪科学”八大类，其不可靠性分值从1到8，按不可靠性从低到高排序。其中，不同来源的域名可靠性标记数据的标签不一定相同，需要将其统一以方便合并。将各个来源的域名可靠性标记数据标签映射到统一的可靠性标签上，如将{“79Days.News”：阴谋论}中的“阴谋论”标签映射到“伪科学”标签上；Construct a unified reliability label, including eight categories of "science", "mainstream media", "sarcasm", "title party", "other", "politics", "false information", "pseudoscience", and its unreliability Scores range from 1 to 8, in order of unreliability from low to high. Among them, the labels of the domain name reliability marking data from different sources are not necessarily the same, and they need to be unified to facilitate merging. Map the domain name reliability mark data labels of various sources to a unified reliability label, such as mapping the "conspiracy theory" label in {"79Days.News": conspiracy theory} to the "pseudoscience" label;

之后合并各个来源的域名可靠性数据，对于有多个标签的域名，采用可靠性最低的标签作为其最终标签；如79Days.News在mediabiasfactcheck中标签是“虚假信息”，在decodex中是“伪科学”，最终合并后该域名标签为“伪科学”；Then merge the domain name reliability data from various sources. For domain names with multiple labels, the label with the lowest reliability is used as the final label; for example, the label of 79Days.News in mediabiasfactcheck is "false information", and in decodex, it is "pseudoscience" ”, the domain name will be labeled as “pseudoscience” after the final merger;

针对可靠性评估极度依赖域名可靠性语料库的问题，本实施例综合考虑各个地区主流网站不同的特点，并基于目前学术界评估流数据可靠性的研究，建立对不同来源的虚假域名标记数据的合并方法；该方法通过合并各个来源的虚假域名标记数据，可以捕捉更全面的虚假信息，从而更深入全面地解释信息疫情演化背后的作用机制。Aiming at the problem that reliability evaluation relies heavily on the domain name reliability corpus, this embodiment comprehensively considers the different characteristics of mainstream websites in various regions, and based on the current academic research on the reliability of streaming data, establishes a combination of false domain name tagging data from different sources. Method; This method can capture more comprehensive false information by merging false domain name tagging data from various sources, thereby providing a more in-depth and comprehensive explanation of the mechanism behind the evolution of information epidemics.

在另一些实施例中，采用公式1计算静态信息疫情风险指标值staticIRI_c，d：In other embodiments, formula 1 is used to calculate the static information epidemic risk index value staticIRI _{c, d} :

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，fans_i表示流数据i的发帖人具有的粉丝数量，r_i表示流数据i的不可靠性分值，

表示发帖数量。Among them, c represents a region, d represents a time period, T _{c, d} represents all stream data with reliability labels in region c during time period d, fans _i represents the number of followers of the poster of stream data i, r _i represents the unreliability score of stream data i,

Indicates the number of posts.

在另一些实施例中，采用公式2计算动态信息疫情风险指标值dynamicIRI_c，d：In other embodiments, the dynamic information epidemic risk index value dynamicIRI _{c, d} is calculated by formula 2:

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，like_i表示流数据i的点赞数，r_i表示流数据i的不可靠性分值，

表示发帖数量。Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in the region c during the d time period, like _i represents the number of likes of stream data _i , and ri represents stream data Unreliability score for data i,

Indicates the number of posts.

在另一些实施例中，采用正则表达式进行域名提取，采用异步爬虫获取域名重定向历史信息；可选地，对于每条流数据，用正则表达式匹配出流数据text字段中的所有域名。之后对这些域名，采用asynchttp进行重定向，获得这些域名的重定向历史；再利用正则表达式，将这些域名的重定向历史与域名可靠性语料库匹配，获得域名重定向历史的可靠性标注；将其中最低的可靠性作为该条流数据的可靠性标注；并将该条流数据的可靠性标注作为reliability字段存储到数据库中；本实施例突破了传统方法仅通过简单匹配来评估流数据可靠性的约束，综合考虑了虚假信息可能存在“域名马甲”的情况，利用重定向技术获得虚假信息的真实域名，从而更加全面准确的对信息疫情风险程度进行评估；因此，基于重定向技术的信息疫情风险计算方法相较之前传统的简单匹配的信息疫情计算方法，结果更加准确全面。In other embodiments, a regular expression is used to extract the domain name, and an asynchronous crawler is used to obtain the domain name redirection history information; optionally, for each piece of flow data, a regular expression is used to match all domain names in the text field of the flow data. Then, use asynchttp to redirect these domain names to obtain the redirection history of these domain names; then use regular expressions to match the redirection history of these domain names with the domain name reliability corpus to obtain the reliability annotation of the domain name redirection history; The lowest reliability is used as the reliability label of the stream data; the reliability label of the stream data is stored in the database as the reliability field; this embodiment breaks through the traditional method of evaluating the reliability of stream data only by simple matching Considering the possible existence of “domain name vests” in false information, the real domain name of false information is obtained by using redirection technology, so as to evaluate the risk of information epidemic more comprehensively and accurately; therefore, information epidemic based on redirection technology Compared with the traditional simple matching information epidemic calculation method, the risk calculation method has more accurate and comprehensive results.

在另一些实施例中，采用地理服务供应商提供的地理编码服务解析用户自定义位置，获得地理信息；可选地，对于每条流数据，提取其的user_location字段，利用requests构建请求发送至ArcGIS服务器，获得user_location的结构化地理信息，并将结构化地理信息作为location字段存储到数据库中；如user_location为“北京航空航天大学”，发送到ArcGIS后得到“中国，北京”的地理信息。In other embodiments, the geocoding service provided by the geographic service provider is used to parse the user-defined location to obtain geographic information; optionally, for each stream data, extract its user_location field, and use requests to construct a request and send it to ArcGIS The server obtains the structured geographic information of user_location, and stores the structured geographic information as a location field in the database; for example, if user_location is "Beijing University of Aeronautics and Astronautics", after sending it to ArcGIS, the geographic information of "China, Beijing" is obtained.

在另一些实施例中，时间分组的粒度为天、周或月，地理信息分组的粒度为市、省或国家；可选地，在数据库中，根据time、location对所有数据进行分组，保证每组的time、location字段是相同的。其中，time的粒度可以设置为天、周、月，location粒度可以设置为省、国家，具体的粒度可以根据分析需要进行调整。In other embodiments, the granularity of time grouping is day, week or month, and the granularity of geographic information grouping is city, province or country; optionally, in the database, all data are grouped according to time and location to ensure that each The time and location fields of the group are the same. Among them, the granularity of time can be set to days, weeks, and months, and the granularity of location can be set to provinces and countries. The specific granularity can be adjusted according to the analysis needs.

在另一些实施例中，还包括：获取历史时间内各时间段内对应的静态信息疫情风险指标值和动态信息疫情风险指标值，以静态信息疫情风险指标值为输入，以动态信息疫情风险指标值为输出，训练获得神经网络预测模型；将当前时间段的静态信息疫情风险指标值输入神经网络预测模型，输出动态信息疫情风险指标值的预测值，将预测值与当前时间段的动态信息疫情风险指标值进行比较，若误差超过第一预定范围，则标记当前时间段的动态信息疫情风险指标值，若误差超过第二预定范围，则重新采集当前时间段的流数据；In other embodiments, the method further includes: obtaining the static information epidemic risk index value and the dynamic information epidemic risk index value corresponding to each time period in the historical time, using the static information epidemic risk index value as input, and using the dynamic information epidemic risk index as input The value is the output, and the neural network prediction model is obtained by training; the static information epidemic risk index value of the current time period is input into the neural network prediction model, the predicted value of the dynamic information epidemic risk index value is output, and the predicted value is compared with the dynamic information epidemic situation of the current time period. The risk index values are compared, and if the error exceeds the first predetermined range, the dynamic information epidemic risk index value of the current time period is marked, and if the error exceeds the second predetermined range, the flow data of the current time period is collected again;

在上述实施例中，在一段时间内，静态信息疫情风险指标值与动态信息疫情风险指标值有较强的关联性，但静态信息疫情风险指标值受人为影响较小，动态信息疫情风险指标值则受人为影响较大，易被人为操作，本实施例旨在对静态信息疫情风险指标值与动态信息疫情风险指标进行内部检测；首先根据历史时间段内的两种指标，训练LSTM模型构建神经网络预测模型；当计算出当前时间段的静态信息疫情风险指标值，则将其输入神经网络预测模型，输出动态信息疫情风险指标值的预测值，将预测值与用公式2计算的动态信息疫情风险指标实际值比较，并计算误差；将误差与第一预定范围和第二预定范围比较，第一预定范围可选为20～30％，当预测值与实际值的误差大于第一预定范围，表明动态信息疫情风险指标实际值一定程度偏高，将此时的动态信息疫情风险指标实际值标记，以供用户区分；当预测值与实际值的误差大于第二预定范围，第一预定范围可选为50～60％，表明动态信息疫情风险指标实际值极大偏高，则此时的动态信息疫情风险指标实际值不准确，需要重新采集数据计算。In the above embodiment, within a period of time, the static information epidemic risk index value has a strong correlation with the dynamic information epidemic risk index value, but the static information epidemic risk index value is less affected by human beings, and the dynamic information epidemic risk index value is relatively small. It is greatly affected by human beings and is easy to be manipulated by humans. The purpose of this embodiment is to perform internal detection on the static information epidemic risk index value and dynamic information epidemic risk index; Network prediction model; when the static information epidemic risk index value of the current time period is calculated, it is input into the neural network prediction model, the predicted value of the dynamic information epidemic risk index value is output, and the predicted value is compared with the dynamic information epidemic calculated by formula 2. Compare the actual value of the risk index and calculate the error; compare the error with the first predetermined range and the second predetermined range, the first predetermined range can be selected as 20-30%, when the error between the predicted value and the actual value is greater than the first predetermined range, Indicates that the actual value of the dynamic information epidemic risk index is high to a certain extent, and the actual value of the dynamic information epidemic risk index at this time is marked for users to distinguish; when the error between the predicted value and the actual value is greater than the second predetermined range, the first predetermined range can be The selection of 50-60% indicates that the actual value of the dynamic information epidemic risk index is extremely high, and the actual value of the dynamic information epidemic risk index at this time is inaccurate, and data collection needs to be re-calculated.

本申请的实施例还提供了多平台信息疫情风险评估装置，包括：采集模块，用于收集各平台的流数据；标签赋予模块，用于对每条流数据，提取域名列表，获得域名重定向历史信息，并与域名可靠性语料库匹配，赋予可靠性标签；地理信息解析模块，用于对每条流数据，解析用户自定义位置，获得地理信息；分组模块，用于按照地理信息、时间两个维度对流数据进行分组；静态信息疫情风险指标值计算模块，用于对每组流数据，基于用户粉丝数、可靠性标签，量化得到反映一个地区在某一时间段内静态信息疫情风险程度的静态信息疫情风险指标值；动态信息疫情风险指标值计算模块，用于对每组流数据，基于点赞数、转发数、评论数、可靠性标签，量化得到反映一个地区在某一时间段内动态信息疫情风险程度的动态信息疫情风险指标值；The embodiment of the present application also provides a multi-platform information epidemic risk assessment device, including: a collection module for collecting flow data of each platform; a label assigning module for extracting a domain name list and obtaining domain name redirection for each piece of flow data Historical information, matched with the domain name reliability corpus, and assigned reliability labels; the geographic information parsing module is used to parse the user-defined location for each stream data to obtain geographic information; the grouping module is used to analyze the geographic information and time The flow data is grouped by one dimension; the static information epidemic risk index value calculation module is used for each group of flow data, based on the number of users and the reliability label, to quantify and obtain the static information epidemic risk degree in a region within a certain period of time. Static information epidemic risk index value; dynamic information epidemic risk index value calculation module, used for each group of stream data, based on the number of likes, retweets, comments, and reliability labels to quantify and reflect a region within a certain period of time. The dynamic information epidemic risk index value of the dynamic information epidemic risk degree;

本实施例利用处理器、存储器实现建立采集模块、标签赋予模块、地理信息解析模块、分组模块、静态信息疫情风险指标值计算模块和动态信息疫情风险指标值计算模块，以实现以上实施例的多平台信息疫情风险评估方法，具体参见上文描述；可以每天都对信息疫情风险程度进行计算，在此基础上得出动、静两周信息疫情风险指标；所有这些指标均按时间、地区存储在数据库中，因此查询数据库可得到各个国家、省份从监测日期开始到当前日期为止的信息疫情演化趋势、分析演化走势的作用机制并进一步对信息疫情未来演化作处预测；有助于提前发布信息疫情的预警信息，迅速掌握信息疫情的发展变化趋势，更好地了解和引导网民心态，发挥网络舆情的积极作用，进而对线下真实疫情有所帮助。In this embodiment, a processor and a memory are used to realize the establishment of a collection module, a label assignment module, a geographic information analysis module, a grouping module, a static information epidemic risk index value calculation module, and a dynamic information epidemic risk index value calculation module, so as to realize many of the above embodiments. The platform information epidemic risk assessment method, please refer to the description above for details; the information epidemic risk degree can be calculated every day, and on this basis, the dynamic and static two-week information epidemic risk indicators can be obtained; all these indicators are stored in the database by time and region Therefore, by querying the database, you can obtain the evolution trend of the information epidemic in various countries and provinces from the monitoring date to the current date, analyze the mechanism of the evolution trend, and further predict the future evolution of the information epidemic; Early warning information, quickly grasp the development and change trends of the information epidemic, better understand and guide the mentality of netizens, and play the positive role of online public opinion, which will help the real epidemic situation offline.

在另一些实施例中，静态信息疫情风险指标值计算模块采用公式1计算静态信息疫情风险指标值staticIRI_c，d：In other embodiments, the static information epidemic risk index value calculation module uses Formula 1 to calculate the static information epidemic risk index value staticIRI _c,d :

其中，c表示一个地区，d表示一个时间段，T_c，d表示c地区在d时间段期间的所有具有可靠性标签的流数据，like_i表示流数据i的点赞数，fans_i表示流数据i的发帖人具有的粉丝数量，r_i表示流数据i的不可靠性分值，

表示发帖数量。Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in region c during the d time period, like _i represents the number of likes of stream data i, and fans _i represents stream data The number of followers of the poster of data _i , ri represents the unreliability score of stream data i,

Indicates the number of posts.

这里说明的设备数量和处理规模是用来简化本发明的说明的。对本发明多平台信息疫情风险评估方法的应用、修改和变化对本领域的技术人员来说是显而易见的。The number of apparatuses and processing scales described here are intended to simplify the description of the present invention. The application, modification and variation of the multi-platform information epidemic risk assessment method of the present invention will be obvious to those skilled in the art.

尽管本发明的实施方案已公开如上，但其并不仅仅限于说明书和实施方式中所列运用，它完全可以被适用于各种适合本发明的领域，对于熟悉本领域的人员而言，可容易地实现另外的修改，因此在不背离权利要求及等同范围所限定的一般概念下，本发明并不限于特定的细节和这里示出与描述的图例。Although the embodiment of the present invention has been disclosed as above, it is not limited to the application listed in the description and the embodiment, and it can be applied to various fields suitable for the present invention. For those skilled in the art, it can be easily Therefore, the invention is not limited to the specific details and illustrations shown and described herein without departing from the general concept defined by the appended claims and the scope of equivalents.

Claims

1. A multi-platform information epidemic risk assessment method, characterized in that it includes:

Step 1: Collect the streaming data of each platform;

Step 2: For each stream data, extract the domain name list, obtain the domain name redirection history information, and match with the domain name reliability corpus to obtain the reliability label of the stream data;

Step 3: For each stream data, analyze the user-defined location to obtain geographic information;

Step 4: Group the streaming data according to the two dimensions of geographic information and time;

Step 5: For each set of streaming data, based on the number of users and reliability labels, quantify the static information epidemic risk index value that reflects the static information epidemic risk level of a region in a certain period of time;

Step 6: For each set of streaming data, based on the number of likes, retweets, comments, and reliability labels, quantify the dynamic information epidemic risk index value that reflects the degree of dynamic information epidemic risk in a region within a certain period of time.

2. The multi-platform information epidemic risk assessment method as claimed in claim 1, further comprising:

Collect domain name reliability annotation data from multiple sources;

Build multi-class reliability labels, assign unreliability scores, and map domain name reliability labeling data from various sources to reliability labels;

According to the reliability labels, the domain name reliability labeling data from various sources is combined to form a domain name reliability corpus. For the domain name reliability labeling data with multiple reliability labels, the reliability label with the lowest unreliability score is used.

3. multi-platform information epidemic risk assessment method as claimed in claim 1, is characterized in that, adopts formula 1 to calculate static information epidemic risk index value staticJRI _{c, d} :

Among them, c represents a region, d represents a time period, T _{c, d} represents all stream data with reliability labels in region c during time period d, fans _i represents the number of followers of the poster of stream data i, r _i represents the unreliability score of stream data i.

4. multi-platform information epidemic risk assessment method as claimed in claim 1, is characterized in that, adopts formula 2 to calculate dynamic information epidemic risk index value dynamicIRI _{c, d} :

Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in the region c during the d time period, like _i represents the number of likes of stream data _i , and ri represents stream data Unreliability score for data i.

5. The multi-platform information epidemic risk assessment method according to claim 1, characterized in that, a regular expression is used to extract the domain name, and an asynchronous crawler is used to obtain the domain name redirection history information.

6 . The multi-platform information epidemic risk assessment method according to claim 1 , wherein the geographic information is obtained by analyzing the user-defined location by using a geocoding service provided by a geographic service provider. 7 .

7. The multi-platform information epidemic risk assessment method according to claim 1, wherein the granularity of time grouping is day, week or month, and the granularity of geographic information grouping is city, province or country.

8. The multi-platform information epidemic risk assessment method of claim 1, further comprising:

Obtain the static information epidemic risk index value and dynamic information epidemic risk index value corresponding to each time period in the historical time, take the static information epidemic risk index value as the input, and take the dynamic information epidemic risk index value as the output, and train to obtain the neural network prediction model ;

Input the static information epidemic risk index value of the current time period into the neural network prediction model, output the predicted value of the dynamic information epidemic risk index value, and compare the predicted value with the dynamic information epidemic risk index value of the current time period. If the error exceeds the second predetermined range, the dynamic information epidemic risk index value of the current time period is marked. If the error exceeds the second predetermined range, the flow data of the current time period is collected again.

9. The multi-platform information epidemic risk assessment device according to claim 1, characterized in that, comprising:

The collection module is used to collect the flow data of each platform;

The label assignment module is used to extract the domain name list for each stream data, obtain the domain name redirection history information, match with the domain name reliability corpus, and assign the reliability label;

The geographic information analysis module is used to analyze the user-defined location for each stream data to obtain geographic information;

The grouping module is used to group the streaming data according to the two dimensions of geographic information and time;

The static information epidemic risk index value calculation module is used to quantify the static information epidemic risk index value that reflects the static information epidemic risk level of a region in a certain period of time based on the number of users and reliability labels for each group of streaming data;

The dynamic information epidemic risk index value calculation module is used to quantify the dynamic information reflecting the epidemic risk degree of dynamic information in a region within a certain period of time based on the number of likes, retweets, comments, and reliability labels for each group of stream data. Information epidemic risk index value.

10. The multi-platform information epidemic risk assessment device according to claim 1, wherein the static information epidemic risk index value calculation module adopts formula 1 to calculate the static information epidemic risk index value staticIRI _{c, d} :

The dynamic information epidemic risk index value calculation module uses formula 2 to calculate the dynamic information epidemic risk index value dynamicIRI _{c, d} :

Among them, c represents a region, d represents a time period, T _{c, d} represents all the stream data with reliability labels in region c during the d time period, like _i represents the number of likes of stream data i, and fans _i represents stream data The number of followers that the poster of data _i has, and ri represents the unreliability score of stream data i.