CN117370932A

CN117370932A - Traffic information processing and sensing method based on multi-mode data fusion sensing

Info

Publication number: CN117370932A
Application number: CN202311376178.1A
Authority: CN
Inventors: 周紫君; 张扬
Original assignee: China Academy of Transportation Sciences
Current assignee: China Academy of Transportation Sciences
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-09

Abstract

The invention discloses a transportation intelligence processing and sensing method based on multi-modal data fusion perception, which includes: the user inputs the field to which the information needs to be obtained to determine the scope of the intelligence acquisition; and calls a large general language model to give the corresponding information. The domain's hypersynonym set Set _up , hyposynonym set Set _down , and synonym Set _syn are used as a basis to generate an initial _A0 generalized search keyword library; through the crawler tool, the _Ai- level generalized search keywords are generated based on this The library crawls data on the network to build an A _i- level intelligence database; imports the crawled data into the multi-modal data fusion sensing system to generate an A _i- level generalized search keyword database, repeating until i is greater than the number of iterations limited by the user; The final A _i- level intelligence knowledge base content is generated into a classified intelligence knowledge base based on institutional sources. Through time scoring and word frequency statistical results, the development trend of the user's field of concern is obtained. The invention can realize an effective screening mechanism to screen the obtained information and improve the efficiency of related work.

Description

Transportation intelligence processing and perception method based on multi-modal data fusion perception

技术领域Technical field

本发明属于数据处理技术领域，具体涉及一种基于多模态数据融合感知的交通运输情报处理及感知方法。The invention belongs to the field of data processing technology, and specifically relates to a transportation information processing and sensing method based on multi-modal data fusion sensing.

背景技术Background technique

对于交通运输行业，其研究的内容除了从技术角度出发的材料、结构、性能、专利和算法，还包括但不限于属于顶层设计的国家战略、法规、行业规范、各国公路相关部门的架构等类型情报，以及各种高新技术的试点与落地，这些信息在一定程度上都可以反应一个国家在交通运输领域的发展程度以及行业在世界范围内的发展趋势，是交通运输工程领域的关键情报。For the transportation industry, in addition to materials, structures, performance, patents and algorithms from a technical perspective, its research content also includes but is not limited to national strategies, regulations, industry specifications, structures of highway-related departments in various countries, etc. that are top-level designs. Intelligence, as well as the pilot and implementation of various high-tech technologies. This information can, to a certain extent, reflect a country's development level in the field of transportation and the development trend of the industry around the world. It is key intelligence in the field of transportation engineering.

然而，这一系列的信息在互联网上一方面以文本、图片、出版物、电台录音、视频信息等多种形式存在，通过传统的爬虫进行情报获取会忽视掉除文本外的其他形式信息，另一方面，这些信息体量庞大，通过传统的爬虫技术获取内容后再人工逐条筛选无异于大海捞针，现阶段随着自媒体的发展，大量的交通领域自媒体人员开始发布视频或者pdf格式的出版物，对某交通运输行业的细分领域进行追踪，并提出独到的见解，这些信息实际上是经过筛选的优质情报，但传统情报获取方法难以获取这些信息。此外，由于交通运输工程行业与国家基础设施建设的发展密不可分，其研究的方向不仅局限于技术，还包括政策、法规等等方向，同一情报对不同研究方向的价值差异显著，需要一种有效筛选机制对获得的情报进行筛选，以提高相关工作的效率。However, this series of information exists on the Internet in various forms such as text, pictures, publications, radio recordings, video information, etc. Obtaining intelligence through traditional crawlers will ignore other forms of information besides text. On the one hand, the amount of information is huge. Obtaining the content through traditional crawler technology and then manually filtering it one by one is tantamount to finding a needle in a haystack. At this stage, with the development of self-media, a large number of self-media personnel in the transportation field have begun to publish videos or pdf formats. It tracks the subdivisions of a certain transportation industry and provides unique insights. This information is actually filtered high-quality intelligence, but it is difficult to obtain this information through traditional intelligence acquisition methods. In addition, since the transportation engineering industry is inseparable from the development of national infrastructure construction, its research direction is not only limited to technology, but also includes policies, regulations, etc. The value of the same information to different research directions is significantly different, and an effective method is needed. The screening mechanism screens the obtained intelligence to improve the efficiency of related work.

现阶段尚不存在能针对交通运输行业情报收集，筛选与趋势感知的系统，常规的情报收集与趋势感知仅依靠大量的人工搜索，或是通过爬虫抓取信息后再进行人工的筛选，费时费力，而且检索效率低，重复内容多，耗时长。At this stage, there is no system that can collect, filter, and trend intelligence for the transportation industry. Conventional intelligence collection and trend sensing only rely on a large number of manual searches, or crawlers to capture information and then manually filter it, which is time-consuming and labor-intensive. , and the retrieval efficiency is low, there is a lot of duplicate content, and it takes a long time.

(1)从搜索方法来看，现有的爬虫仅能根据用户输入的关键词在常规的搜索引擎或者用户指定的域名中进行搜索，这种搜索方法所获取的情报极大的取决于用户对所研究领域的熟悉程度，爬虫所爬取的一切内容都是基于用户所输入的正则表达式进行检索，对于交通运输行业来说，同样的内容实体有不同的表达方式，会显著影响检索结果，因此常规爬虫所得结果容错率低，重复性高导致实际的使用效果并不理想。(1) From the perspective of search methods, existing crawlers can only search in conventional search engines or user-specified domain names based on the keywords entered by the user. The information obtained by this search method greatly depends on the user’s Familiarity with the research field. All content crawled by the crawler is retrieved based on the regular expression entered by the user. For the transportation industry, the same content entity has different expressions, which will significantly affect the retrieval results. Therefore, the results obtained by conventional crawlers have low fault tolerance and high repeatability, resulting in unsatisfactory actual use results.

(2)从爬虫爬取的数据类型来看，尽管其可将文本数据，图片数据，出版物数据以及音视频数据保存下来，但对于交通运输领域来说，数据的类型丰富且每一种数据都体量庞大，单靠人工从这些结果中筛选出有用的部分无异于大海捞针，缺乏一种有效的手段对数据进行清洗，归类，提炼，以及在搜索过程中对多种模态的数据进行融合利用，逆向提高搜索的效率与精确度。(2) From the perspective of the data types crawled by the crawler, although it can save text data, picture data, publication data, and audio and video data, for the transportation field, the types of data are rich and each type of data is They are all huge in size. Manually filtering out useful parts from these results is equivalent to finding a needle in a haystack. There is a lack of an effective means to clean, classify, and refine the data, as well as to search for multiple modal data during the search process. Perform fusion and utilization to reversely improve the efficiency and accuracy of search.

(3)常规爬虫获得的情报结果往往仅为一个包含网页标题、数据以及原始链接的列表，内容繁杂，难以直接呈现出行业的发展趋势，需要辅以大量的人力进行筛选验证，不利于及时把握国外交通运输领域的动向。(3) The intelligence results obtained by conventional crawlers are often only a list containing web page titles, data and original links. The content is complex and difficult to directly present the development trend of the industry. It requires a large amount of manpower for screening and verification, which is not conducive to timely grasping. Trends in foreign transportation fields.

(4)由于交通运输工程行业与国家基础设施建设的发展密不可分，其研究的方向不仅局限于技术，还包括政策，法规等等方向，同一情报对不同研究方向的价值差异显著，需要一种有效筛选机制对获得的情报的价值进行自动评分，筛选，以提高相关工作的效率，这是爬虫技术所不具备的。(4) Since the transportation engineering industry is inseparable from the development of national infrastructure construction, its research direction is not only limited to technology, but also includes policies, regulations, etc. The value of the same information to different research directions is significantly different, and a method is needed The effective screening mechanism automatically scores and screens the value of the obtained intelligence to improve the efficiency of related work, which is not available in crawler technology.

因此，从数据获取的角度以及数据筛选的角度来看，现阶段的技术都无法满足交通运输领域有效情报获取的需求，急需开发一种针对交通运输领域，基于多模态大数据融合感知的新型情报收集与趋势感知方法，才能满足现阶段交通运输行业情报收集，筛选与趋势感知的需求。Therefore, from the perspective of data acquisition and data screening, the current technology cannot meet the needs of effective intelligence acquisition in the transportation field. There is an urgent need to develop a new type of multi-modal big data fusion perception for the transportation field. Only intelligence collection and trend perception methods can meet the current needs of intelligence collection, screening and trend perception in the transportation industry.

发明内容Contents of the invention

为解决现有技术中存在的问题，本发明提供一种基于多模态数据融合感知的交通运输情报处理及感知方法，实现有效筛选机制对获得的情报进行筛选，提高相关工作的效率。In order to solve the problems existing in the existing technology, the present invention provides a transportation intelligence processing and sensing method based on multi-modal data fusion sensing, which realizes an effective screening mechanism to filter the obtained intelligence and improves the efficiency of related work.

本发明实施例的提供一种基于多模态数据融合感知的交通运输情报处理及感知方法，包括如下步骤：An embodiment of the present invention provides a transportation information processing and sensing method based on multi-modal data fusion sensing, which includes the following steps:

步骤S1，由用户输入需要获取情报所属的领域，以确定情报获取的范围；Step S1: The user inputs the field in which the intelligence needs to be obtained to determine the scope of intelligence acquisition;

步骤S2，调用通用语言大模型，给出所属领域的上义词集Set_up、下义词集Set_down、与同义词Set_syn，并以此为基础生成初始A₀广义检索关键词库；Step S2, call the general language large model to give the hyperonym set Set _up , hyponym set _down , and synonym Set _syn in the corresponding field, and generate an initial A ₀ generalized search keyword database based on this;

步骤S3，通过爬虫工具，根据所述A_i级广义检索关键词库在网络上爬取数据，构建A_i级情报库；Step S3: Use a crawler tool to crawl data on the network based on the Ai _- level generalized search keyword database to build an _Ai- level intelligence database;

步骤S4，将爬取数据导入多模态数据融合感知系统，生成A_i级广义检索关键词库，重复所述步骤S3，直至i大于用户所限定的迭代次数，跳出循环；所述多模态数据融合感知系统包括：多模态数据语义化子系统和多模态语义数据融合子系统，其中，Step S4, import the crawled data into the multi-modal data fusion sensing system, generate an A _i- level generalized search keyword database, repeat the step S3 until i is greater than the number of iterations defined by the user, and jump out of the loop; the multi-modal The data fusion perception system includes: multimodal data semantic subsystem and multimodal semantic data fusion subsystem, among which,

所述多模态数据语义化子系统用于对视频数据、图片数据、pdf数据和音频数据分别急进行识别转换处理，生成处理后的数据库A_i-1；The multi-modal data semantic subsystem is used to perform recognition and conversion processing on video data, picture data, pdf data and audio data respectively, and generate the processed database A _i-1 ;

所述多模态语义数据融合子系统用于根据交通运输行业的特点构造评分体系，对所述多模态数据语义化子系统检索得到的内容A_i-1进行有效性评估，生成每条情报数据的有效性得分，根据得到对情报进行排序并处理，将排名前列的词展示给用户，供用户进行筛选，将用户筛选后的结果生成新的新的广义检索关键词库A_i+1，跳转至所述步骤S2，将A_i+1替代A₀，利用第一次检索的结果，通过数据融合和评分优选，提炼出的关键词，以扩大检索数据库；The multimodal semantic data fusion subsystem is used to construct a scoring system according to the characteristics of the transportation industry, evaluate the validity of the content A _i-1 retrieved by the multimodal data semantic subsystem, and generate each piece of information According to the validity score of the data, the intelligence is sorted and processed, and the top-ranked words are displayed to the user for users to filter. The user's filtered results are used to generate a new generalized search keyword database A _i+1 . Jump to step S2, replace A ₀ with A _i+1 , and use the results of the first search to expand the search database by extracting keywords through data fusion and scoring optimization;

其中，所述多模态语义数据融合子系统对检索得到的内容进行有效性评估，包括：Among them, the multi-modal semantic data fusion subsystem evaluates the validity of the retrieved content, including:

(1)关键词复现评分p_k：从关键词在文本数据中出现的次数进行评价；(1) Keyword recurrence score p _k : evaluated based on the number of times the keyword appears in the text data;

(2)时效性评分p_t：从时效性的角度对检索出来的情报的价值进行评价；(2) Timeliness score p _t : Evaluate the value of the retrieved information from the perspective of timeliness;

(3)机构权威性评分p_a：从文本来源权威性的角度，判断第e条文本在该用户需求下的价值大小；(3) Institutional authority score p _a : From the perspective of the authoritativeness of the text source, determine the value of the e-th text under the needs of the user;

最终A_i-1中第e个情报的有效性得分为The final effectiveness score of the e-th intelligence in A _i-1 is

score＝p_k+p_t+p_a；score＝p _k +p _t +p _a ;

步骤S5，对最终所得的A_i级情报知识库内容按照机构来源生成分类情报知识库，通过时间评分与词频统计结果，得到用户关注领域的发展趋势。Step S5: Generate a classified intelligence knowledge base based on the institutional source of the final A _i- level intelligence knowledge base content, and obtain the development trend of the user's field of concern through time scoring and word frequency statistical results.

优选的是，在所述步骤S2中，所述初始A₀广义检索关键词库包括：Preferably, in step S2, the initial A ₀ generalized search keyword library includes:

{A₀}＝Set_up∪Set_down∪Set_syn，{A ₀ }＝Set _up ∪Set _down ∪Set _syn ,

其中，上义词集：Set_up＝{u₁，u₂，u₃......u_i}；Among them, the set of upper meaning words: Set _up = {u ₁ , u ₂ , u ₃ ...... u _i };

下义词集：Set_down＝{d₁，d₂，d₃......d_j}；Set of hyponyms: Set _down = {d ₁ , d ₂ , d ₃ ...d _j };

同义词集：Set_syn＝{s₁，s₂，s₃......s_k}；Synonym set: Set _syn = {s ₁ , s ₂ , s ₃ ......s _k };

则交叉扩充集表示为：Then the cross expansion set is expressed as:

Set_expand＝{u₁ d₁，u₁ d₂，u₁ d₃....u₂ d₁，u₂ d₂...u₃ d₁...u_i d_j...u_i s_k}。Set _expand ={u ₁ d ₁ , u ₁ d ₂ , u ₁ d ₃ ....u ₂ d ₁ , u ₂ d ₂ ...u ₃ d ₁ ...u _i d _j ...u _i _sk }.

在上述任一方案中优选的是，在所述步骤S3中，所述A_i级情报库包括多条情报数据，每条所述情报数据包括：检索时采用的关键词、该网页的原始链接、主域名、网页初始上线时间和网页中的情报数据。In any of the above solutions, preferably, in step S3, the Ai _- level intelligence database includes multiple pieces of intelligence data, and each piece of intelligence data includes: keywords used in retrieval, and the original link of the web page. , the main domain name, the initial launch time of the web page and the intelligence data in the web page.

在上述任一方案中优选的是，(1)对于视频数据，基于pythonCV2库对视频进行自动逐帧截图并将获得的逐帧数据加入到对应列的图片数据；基于python moviepy库将其中的音频分离出来，保存到对应列的音频数据；In any of the above solutions, it is preferred that (1) for video data, the video is automatically captured frame by frame based on the pythonCV2 library and the obtained frame-by-frame data is added to the picture data of the corresponding column; the audio is captured based on the python moviepy library Separate it and save it to the audio data in the corresponding column;

(2)对于图片数据与pdf数据，基于python Tesseract库通过OCR方法进行识别，生成的文字内容加入到同一列的文本数据中；(2) For image data and PDF data, the OCR method is used to identify the data based on the python Tesseract library, and the generated text content is added to the text data in the same column;

(3)对于音频数据，调用语音识别的API将音频转化为文字，并添加到同一列的文本数据中。(3) For audio data, call the speech recognition API to convert the audio into text and add it to the text data in the same column.

在上述任一方案中优选的是，所述多模态语义数据融合子系统进行关键词复现评分，包括如下步骤：In any of the above solutions, it is preferred that the multi-modal semantic data fusion subsystem performs keyword recurrence scoring, including the following steps:

设A_i中第j个关键词在第e条文本数据中出现的次数为a_l，该条文本数据的词数为AWN，则第e条文本数据的关键词复现评分p_k如下所示：Suppose the number of occurrences of the j-th keyword in A _i in the e-th piece of text data is a _l , and the number of words in this piece of text data is AWN, then the keyword recurrence score p _k of the e-th piece of text data is as follows :

在上述任一方案中优选的是，所述多模态语义数据融合子系统进行时效性评分，包括如下步骤：In any of the above solutions, it is preferred that the multi-modal semantic data fusion subsystem performs timeliness scoring, including the following steps:

首先，判断e条情报间是否直接相等，若相同，则在A_i-1中删除重复项，并将重复项提取为子集set；然后，对降重后的A_i-1进行广义降重；最后，对降重后A_i-1中的情报进行时效性评分。First, determine whether the e pieces of information are directly equal. If they are the same, delete the duplicates in A _i-1 and extract the duplicates into a subset set; then, perform general weight reduction on the reduced A _i-1. ; Finally, rate the timeliness of the information in A _i-1 after weight reduction.

在上述任一方案中优选的是，所述多模态语义数据融合子系统对降重后的A_i-1进行广义降重，包括如下步骤：In any of the above solutions, it is preferred that the multi-modal semantic data fusion subsystem performs general weight reduction on the weight-reduced A _i-1 , including the following steps:

首先根据计算机储存空间的大小及检索出来文本数据的体量，设置窗口长度和重叠长度，对第e条文本情报进行拆解，i_e个句段元素的集合Set_te；First, according to the size of the computer storage space and the volume of the retrieved text data, set the window length and overlap length, disassemble the e-th text information, and set _te the set of i _e segment elements;

然后对集合Set_te中每个句段元素按词切割，并进行词干提取，计算词频进行hash编码，得到该句段元素的压缩编码数据；Then, each segment element in the set Set _te is cut by word, stems are extracted, word frequency is calculated and hash coding is performed, and the compressed coded data of the segment element is obtained;

最后计算集合中各句段元素压缩编码的汉明距离，并进行相似度计算，将重复项提取为子集set。Finally, the Hamming distance of the compression codes of each segment element in the set is calculated, the similarity is calculated, and the duplicates are extracted into a subset set.

在上述任一方案中优选的是，所述多模态语义数据融合子系统对降重后A_i-1中的情报进行时效性评分，包括如下步骤：In any of the above solutions, it is preferred that the multi-modal semantic data fusion subsystem performs timeliness scoring on the intelligence in A _i-1 after weight reduction, including the following steps:

计算A_i-1中每一条情报的时间跨度系数T_range；Calculate the time span coefficient T _range of each piece of information in A _i-1 ;

根据网页初始上线时间，提取每条情报的最早上线时间，构造向量，进行归一化操作，计算得到时效系数T_early；According to the initial online time of the web page, extract the earliest online time of each piece of information, construct a vector, perform normalization operations, and calculate the timeliness coefficient T _early ;

根据时间跨度系数T_range和时效系数T_early，计算时效性评分p_t为：According to the time span coefficient T _range and the timeliness coefficient T _early , the timeliness score p _t is calculated as:

p_t＝k_rangeT_range+k_earlyT_early；p _t =k _range T _range +k _early T _early ;

其中，k_range为时间跨度系数权重；k_early为时效系数权重。Among them, k _range is the time span coefficient weight; k _early is the aging coefficient weight.

在上述任一方案中优选的是，所述机构权威性评分p_a根据主域名选取的数据来源包括：新闻类、论坛类、政府类、智库类、技术类。In any of the above solutions, it is preferable that the data sources selected by the institution authority score p _a according to the main domain name include: news category, forum category, government category, think tank category, and technology category.

本发明实施例的基于多模态数据融合感知的交通运输情报处理及感知方法，具有以下有益效果：The transportation information processing and sensing method based on multi-modal data fusion sensing according to the embodiment of the present invention has the following beneficial effects:

(1)通过借用成熟的语言大模型，通过对用户所输入关键词的上下义词，同义词进行联想，并通过算法以此为基础构建关键词检索库，以用户的关键词为中心，扩大搜索范围，可在一定程度上，降低首次搜索时由于用户本身对该领域熟悉程度不够导致的收集到的情报质量低的问题。(1) By borrowing mature large-scale language models, by associating the hyponyms and synonyms of the keywords entered by the user, and using algorithms to build a keyword retrieval library based on this, we can expand the search with the user's keywords as the center. The scope can, to a certain extent, reduce the problem of low-quality intelligence collected during the first search due to the user's lack of familiarity with the field.

(2)针对交通运输领域情报存在形式丰富，而传统爬虫只能单纯下载数据，不能解析数据的问题，本发明基于语音识别技术，OCR识别技术，构建了多模态数据语义化系统，将多模态数据转化为文本数据，实现了现有各类型情报的充分利用。(2) In view of the problem that information in the field of transportation exists in various forms, but traditional crawlers can only download data and cannot parse the data, the present invention builds a multi-modal data semantic system based on speech recognition technology and OCR recognition technology, integrating multiple Modal data is converted into text data, making full use of existing types of intelligence.

(3)针对交通运输领域研究问题的差异，设计了多模态语义数据融合感知系统，创新的针对交通运输领域政策研究与技术研究两大主体方向，从主题相关性、情报时效性(时间跨度与时间优先度)、情报来源价值度等三个角度出发，设计了交通运输领域情报价值度评估算法，对语义化后的情报数据按照交通运输行业的常规认识进行深度筛选，并通过价值度排名、情报时效评分以及词频统计，实现两大基本领域的趋势感知。(3) In view of the differences in research issues in the field of transportation, a multi-modal semantic data fusion perception system is designed, innovatively targeting the two main directions of policy research and technology research in the field of transportation, from the perspective of topic relevance, information timeliness (time span Based on three perspectives, including time priority) and the value of intelligence sources, an intelligence value evaluation algorithm in the field of transportation is designed. The semanticized intelligence data is deeply screened according to the conventional understanding of the transportation industry, and ranked by value. , intelligence timeliness scoring and word frequency statistics to achieve trend perception in two basic areas.

(4)基于多模态语义数据融合感知系统所解析的行业趋势，进一步通过自然语言处理技术，在初次搜索后为用户提供推荐的关键词大类，由此根据用户的选择通过半监督的方法构造了下一步搜索的关键词库，从而实现以交通运输行业认知为基础的即时定向关键词扩张，极大的提高了搜索范围，降低了无效情报的收录，提高了搜索效率。(4) Based on the industry trends analyzed by the multi-modal semantic data fusion perception system, further using natural language processing technology to provide users with recommended keyword categories after the initial search, thereby using a semi-supervised method based on the user's choice A keyword library for the next step of search is constructed to achieve real-time directional keyword expansion based on the knowledge of the transportation industry, which greatly increases the search scope, reduces the collection of invalid information, and improves search efficiency.

附图说明Description of the drawings

图1为根据本发明实施例的基于多模态数据融合感知的交通运输情报处理及感知方法的流程图。Figure 1 is a flow chart of a transportation intelligence processing and sensing method based on multi-modal data fusion sensing according to an embodiment of the present invention.

具体实施方式Detailed ways

为了更进一步了解本发明的发明内容，下面将结合具体实施例详细阐述本发明。In order to further understand the content of the present invention, the present invention will be described in detail below with reference to specific embodiments.

如图1所示，本发明实施例的基于多模态数据融合感知的交通运输情报处理及感知方法，包括如下步骤：As shown in Figure 1, the transportation intelligence processing and sensing method based on multi-modal data fusion sensing according to the embodiment of the present invention includes the following steps:

步骤S1，由用户输入需要获取情报所属的领域，从而确定情报获取的范围。In step S1, the user inputs the field in which the information needs to be obtained, thereby determining the scope of information acquisition.

步骤S2，调用通用语言大模型，给出所属领域的上义词集Set_up、下义词集Set_down、与同义词Set_syn，并以此为基础生成初始A₀广义检索关键词库。Step S2: Call the general language large model to give the hypersynonym set Set _up , hyponym set set _down , and synonym Set _syn in the corresponding field, and generate an initial A ₀ generalized search keyword library based on this.

其中，通用大语言大模型可为任意开源或者商用的语言大模型；初始广义检索关键词库内的关键词包括Among them, the general large language model can be any open source or commercial large language model; the keywords in the initial generalized search keyword database include

则交叉扩充集表示为：Then the cross expansion set is expressed as:

步骤S3，通过爬虫工具，根据所述A_i级广义检索关键词库在网络上爬取数据，构建A_i级情报库。Step S3: Use a crawler tool to crawl data on the Internet based on the Ai _- level generalized search keyword database to build an _Ai- level intelligence database.

在本步骤中，A_i级情报库包括多条情报数据，每条所述情报数据包括：检索时采用的关键词、该网页的原始链接、主域名、网页初始上线时间和网页中的情报数据。In this step, the _Ai- level intelligence database includes multiple pieces of intelligence data. Each piece of intelligence data includes: the keywords used in the search, the original link to the webpage, the main domain name, the initial launch time of the webpage, and the intelligence data in the webpage. .

具体的，爬虫工具能获取网页中的文本信息，下载图片，pdf以及音视频等数据，同时能够读取网页的上线时间，最后抓取的数据形成A_i级情报库。每一条情报数据项包括：检索时采用的关键词，该网页的原始链接、主域名、网页初始上线时间和网页中的情报数据。其中，网页中的情报数据包括：文本数据、图片数据、音视频数据、pdf出版物等数据形式。Specifically, the crawler tool can obtain the text information in the web page, download pictures, PDFs, audio and video and other data, and can also read the online time of the web page, and finally the captured data forms an _Ai- level intelligence library. Each intelligence data item includes: the keywords used in the search, the original link of the webpage, the main domain name, the initial online time of the webpage and the intelligence data in the webpage. Among them, the intelligence data in the web page includes: text data, picture data, audio and video data, pdf publications and other data forms.

步骤S4，将爬取数据导入多模态数据融合感知系统，生成A_i级广义检索关键词库，重复所述步骤S3，直至i大于用户所限定的迭代次数，跳出循环。Step S4: Import the crawled data into the multi-modal data fusion sensing system to generate an A _i- level generalized search keyword database. Repeat step S3 until i is greater than the number of iterations defined by the user, and then break out of the loop.

在本发明的实施例中，多模态数据融合感知系统包括：多模态数据语义化子系统和多模态语义数据融合子系统。In embodiments of the present invention, the multimodal data fusion perception system includes: a multimodal data semantic subsystem and a multimodal semantic data fusion subsystem.

多模态数据语义化子系统用于对视频数据、图片数据、pdf数据和音频数据分别急进行识别转换处理，生成处理后的数据库A_i-1。The multi-modal data semantic subsystem is used to perform recognition and conversion processing on video data, picture data, pdf data and audio data respectively, and generate the processed database A _i-1 .

(1)对于视频数据，首先基于python CV2库对视频进行自动逐帧截图并将获得的逐帧数据加入到对应列的图片数据；基于python moviepy库将其音频部分分离出来，保存到对应列的音频数据。(1) For video data, first use the python CV2 library to automatically capture the video frame by frame and add the obtained frame-by-frame data to the picture data of the corresponding column; use the python moviepy library to separate the audio part and save it to the corresponding column. audio data.

(2)对于图片数据与pdf数据，基于python Tesseract库通过OCR方法进行识别，生成的文字内容加入到同一列的文本数据中。(2) For image data and PDF data, they are identified through the OCR method based on the python Tesseract library, and the generated text content is added to the text data in the same column.

(4)处理后的数据库命名为A_i-1。(4) The processed database is named A _i-1 .

多模态语义数据融合子系统用于根据交通运输行业的特点构造评分体系，对检索得到的内容进行有效性评估，生成每条情报数据的有效性得分，根据得到对情报进行排序并处理，将排名前列的词展示给用户，供用户进行筛选，将用户筛选后的结果生成新的新的广义检索关键词库A_i+1，跳转至所述步骤S3。The multimodal semantic data fusion subsystem is used to construct a scoring system based on the characteristics of the transportation industry, evaluate the validity of the retrieved content, generate a validity score for each piece of intelligence data, sort and process the intelligence based on the obtained information, and The top-ranked words are displayed to the user for filtering, and a new generalized search keyword library A _i+1 is generated based on the user's filtered results, and the process jumps to step S3.

多模态语义数据融合子系统对检索得到的内容进行有效性评估，包括：The multimodal semantic data fusion subsystem evaluates the validity of the retrieved content, including:

(1)关键词复现评分p_k：从关键词在文本数据中出现的次数进行评价。(1) Keyword recurrence score p _k : evaluated based on the number of times the keyword appears in the text data.

(2)时效性评分p_t：从时效性的角度对检索出来的情报的价值进行评价。(2) Timeliness score p _t : Evaluate the value of the retrieved information from the perspective of timeliness.

多模态语义数据融合子系统对A_i-1降重后的进行广义降重，包括如下步骤：The multi-modal semantic data fusion subsystem performs generalized weight reduction on A _i-1 after weight reduction, including the following steps:

首先，执行语块分割。根据计算机储存空间的大小及检索出来文本数据的体量确定滑动窗口长度a，重叠长度b。其中，a远大于b。由此，对长度为d的第e条文本情报进行拆解，将获得含有i_e个句段元素的集合Set_te。First, chunk segmentation is performed. Determine the sliding window length a and overlap length b according to the size of the computer storage space and the volume of retrieved text data. Among them, a is much larger than b. Therefore, by disassembling the e-th piece of text information with length d, a set Set _te containing i _e segment elements will be obtained.

然后，执行压缩编码。对集合Set_te中每个句段元素按词切割，并进行词干提取，计算词频进行hash编码，得到该句段元素的压缩编码数据。Then, compression encoding is performed. Each segment element in the set Set _te is cut by word, word stems are extracted, word frequency is calculated and hash coding is performed, and the compressed coded data of the segment element is obtained.

具体的，基于开源语料库，采用PythonNLTK库对集合Set_te中每个句段元素按词切割，并进行词干提取，消除词形，非实义词的影响。计算词频，对词频前a*20％的单词进行普通hash变换，得到每个词对应的hash编码；hash为二进制编码，为0时，取-1，1时取1，即00101--->[-1,-1,1,-1,1]，将变换后的结果存储为行向量，对其进行列求和，得到该句段元素的压缩编码。Specifically, based on the open source corpus, the PythonNLTK library is used to cut each segment element in the set Set _te by word, and perform word stem extraction to eliminate the influence of word form and non-substantial words. Calculate the word frequency, perform ordinary hash transformation on the words with the first a*20% of the word frequency, and obtain the hash code corresponding to each word; the hash is a binary code. When it is 0, it takes -1, and when it is 1, it takes 1, that is, 00101---> [-1,-1,1,-1,1], store the transformed result as a row vector, perform column summation on it, and obtain the compression encoding of the segment elements.

最后，计算集合中各句段元素压缩编码的汉明距离，并进行相似度计算，将重复项提取为子集set。Finally, the Hamming distance of the compression coding of each segment element in the set is calculated, the similarity is calculated, and the duplicates are extracted into a subset set.

具体的，计算集合中各句段元素压缩编码的汉明距离(对两个编码间进行异或运算，计算结果1的个数)，将计算结果倒序，取前c％的组合认为是相似的(c由用户确定)，根据相似的情况在A_i-1中删除重复项，并将重复项提取为子集set。Specifically, the Hamming distance of the compression codes of each segment element in the set is calculated (exclusive OR operation is performed between the two codes, and the number of results 1 is calculated), the calculation results are reversed, and the top c% combinations are considered to be similar. (c is determined by the user), remove duplicates in A _i-1 based on similar situations, and extract duplicates into a subset set.

多模态语义数据融合子系统对降重后A_i-1中的情报进行时效性评分，对降重后A_i-1中的第e条情报，主要通过情报在时间上延续的长度与首次出现的时间对其时效性进行评价。对于政策研究而言，政策类文件在情报中出现的时间长度范围越广，证明其对该行业的支撑作用越大，时效性评分越高，而技术研究则相反，时间跨度越大，则研究内容越成熟，创新性越差，因此需针对不同用户需求，分别进行评价。The multi-modal semantic data fusion subsystem scores the timeliness of the information in A _i-1 after the weight reduction. The e-th piece of information in A _i-1 after the weight reduction is mainly based on the time duration and first time of the information. Evaluate its timeliness based on the time it appears. For policy research, the wider the time span in which policy documents appear in intelligence, the greater its support for the industry and the higher the timeliness score. However, the opposite is true for technical research. The larger the time span, the higher the timeliness score. The more mature the content, the less innovative it is, so it needs to be evaluated separately based on different user needs.

具体来说，时效性评分，包括如下步骤：Specifically, timeliness scoring includes the following steps:

(1)计算A_i-1中每一条情报的时间跨度系数T_range。(1) Calculate the time span coefficient T _range of each piece of information in A _i-1 .

依据网页初始上线时间，计算A_i-1中每一条情报的时间跨度，对于没有重复项的情报，其时间跨度为0，对于存在相似子集的情报，其时间跨度为相似子集中时间最晚的情报与时间最早的情报之差(按天数记)，将所有情报条目的时间跨度构造新向量，进行归一化操作，得的值即为时间跨度系数。Based on the initial launch time of the web page, calculate the time span of each piece of information in A _i-1 . For information with no duplicates, its time span is 0. For information with similar subsets, its time span is the latest time in the similar subset. The difference between the intelligence and the earliest intelligence (recorded in days), construct a new vector from the time span of all intelligence entries, perform normalization operation, and the obtained value is the time span coefficient.

(2)根据网页初始上线时间，提取每条情报的最早上线时间，构造向量，进行归一化操作，计算得到时效系数T_early。(2) According to the initial online time of the web page, extract the earliest online time of each piece of information, construct a vector, perform normalization operations, and calculate the timeliness coefficient T _early .

其中，对于高新技术情报任务，k_range＝0.2，k_early＝0.8；对于成熟技术情报任务k_range＝0.6，k_early＝0.4；对于政策支撑情报k_range＝0.7，k_early＝0.3；对于一般政策任务k_range＝0.6，k_early＝0.4。Among them, for high-tech intelligence tasks, k _range = 0.2, k _early = 0.8; for mature technology intelligence tasks, k _range = 0.6, k _early = 0.4; for policy support intelligence, k _range = 0.7, k _early = 0.3; for general policies Task k _range =0.6, k _early =0.4.

(3)机构权威性评分p_a：从文本来源权威性的角度，判断第e条文本在该用户需求下的价值大小。(3) Institutional authority score p _a : From the perspective of the authoritativeness of the text source, determine the value of the e-th text under the needs of the user.

在本发明的实施例中，机构权威性评分p_a根据主域名选取的数据来源包括：新闻类、论坛类、政府类、智库类、技术类等。需要说明的是，机构权威性评分p_a的数据来源不限于上述举例，还可以包括其他类别，根据需要由用户进行调整设置，在此不再赘述。In the embodiment of the present invention, the data sources selected by the institutional authority score p _a according to the main domain name include: news category, forum category, government category, think tank category, technology category, etc. It should be noted that the data source of the institution's authoritative rating p _a is not limited to the above examples, and can also include other categories, and the settings can be adjusted by the user as needed, which will not be described again here.

具体来说，根据主域名可将数据的来源主要分为新闻类、论坛类、政府类、智库类、技术类等五大类来源，不同来源的情报价值在不同的研究方向上具有明显差异，交通运输行业主要包括技术类研究与政策类研究，不同的研究方向下五种来源的评分表如下表1所示，实际应用时，数据机构来源的判定依据数据的主域名词条与预设的域名-机构映射库进行判断。如表1所示，1-新闻类p_news、2-论坛类p_blog、3-政府类p_goverment、4-智库类p_thinktank、5-技术类p_tech。Specifically, according to the main domain name, the sources of data can be mainly divided into five major categories: news, forums, government, think tanks, and technology. The intelligence value of different sources has obvious differences in different research directions. Transportation The transportation industry mainly includes technical research and policy research. The scoring tables of the five sources under different research directions are shown in Table 1 below. In actual application, the source of the data institution is determined based on the main domain name entry of the data and the preset domain name. -Organization mapping library for judgment. As shown in Table 1, 1-news p _news , 2-forum p _blog , 3-government p _governance , 4-think tank p _thinktank , 5-technology p _tech .

表1五种来源的评分表Table 1 Rating form for five sources

P_news P _news P_blog P _blog P_goverment P _governance P_thinktank P _thinktank P_tech P _tech 技术类Technology category 0.20.2 0.20.2 0.40.4 0.60.6 0.80.8 政策类Policy 0.60.6 0.60.6 0.80.8 0.60.6 0.20.2

(4)最终A_i-1中第e个情报的有效性得分为(4) The final validity score of the e-th information in A _i-1 is

score＝p_k+p_t+p_a。score=p _k +p _t +p _a .

根据得分对情报排序，保留前e*b(b为用户自定义的比例)个情报数据，基于pythonnltk库进行短语提取，计算词频，排序，保留前c％(c为用户自定义的比例)的短语，通过短语来源在计算相似度时得倒的汉明距离，对短语进行分类，将各分类中词频最高的前十个词作为该子类的代表展示给用户，让用户对关键词进行快速筛选，筛选的结果生成新的广义检索关键词库A_i+1，跳转步骤2。将A_i+1替代A₀，利用第一次检索的结果，通过数据融合和评分优选，提炼出的关键词，以扩大检索数据库。Sort the intelligence according to the score, retain the top e*b (b is a user-defined proportion) intelligence data, extract phrases based on the pythonnltk library, calculate word frequency, sort, and retain the top c% (c is a user-defined proportion) Phrases are classified based on the Hamming distance of the phrase source when calculating similarity. The top ten words with the highest word frequency in each category are displayed to the user as representatives of the subcategory, allowing users to quickly identify keywords. Filter, the filtered result generates a new generalized search keyword database A _i+1 , jump to step 2. Replace A _i+1 with A ₀ and use the results of the first search to expand the search database by extracting keywords through data fusion and scoring optimization.

根据本发明实施例的基于多模态数据融合感知的交通运输情报处理及感知方法，可以解决交常规情报收集系统在交通运输领域情报的收集效率低，有效信息少，难以为相关行业的研究人员与政策制定者提供有效的行业发展趋势信息的问题，实现有效筛选机制对获得的情报进行自动评分和筛选，提高相关工作的效率，利用有效的手段对数据进行清洗、归类和提炼，以及在搜索过程中对多种模态的数据进行融合利用，逆向提高搜索的效率与精确度，实现及时把握国外交通运输领域的动向。According to the transportation intelligence processing and sensing method based on multi-modal data fusion sensing according to the embodiment of the present invention, it can solve the problem that the conventional intelligence collection system in the field of transportation has low intelligence collection efficiency and little effective information, which makes it difficult for researchers in related industries to Issues related to providing policy makers with effective industry development trend information, implementing an effective screening mechanism to automatically score and filter the obtained intelligence, improving the efficiency of related work, using effective means to clean, classify and refine data, and During the search process, data from multiple modalities are integrated and utilized to reversely improve the efficiency and accuracy of the search, enabling timely grasp of trends in the field of foreign transportation.

特别说明：本发明的技术方案中涉及了诸多参数，需要综合考虑各个参数之间的协同作用，才能获得本发明的有益效果和显著进步。而且技术方案中各个参数的取值范围都是经过大量试验才获得的，针对每一个参数以及各个参数的相互组合，发明人都记录了大量试验数据，限于篇幅，在此不公开具体试验数据。Special note: The technical solution of the present invention involves many parameters, and the synergy between each parameter needs to be comprehensively considered in order to obtain the beneficial effects and significant progress of the present invention. Moreover, the value range of each parameter in the technical solution was obtained through a large number of tests. For each parameter and the combination of each parameter, the inventor has recorded a large amount of test data. Due to space limitations, the specific test data will not be disclosed here.

本领域技术人员不难理解，本发明的基于多模态数据融合感知的交通运输情报处理及感知方法方法包括上述本发明说明书的发明内容和具体实施方式部分以及附图所示出的各部分的任意组合，限于篇幅并为使说明书简明而没有将这些组合构成的各方案一一描述。凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the transportation information processing and sensing method based on multi-modal data fusion sensing of the present invention includes the above-mentioned content of the invention and specific embodiments of the description of the present invention and each part shown in the drawings. Any combination, limited by space and to make the description concise, each solution composed of these combinations is not described one by one. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. A traffic information processing and sensing method based on multi-mode data fusion sensing is characterized by comprising the following steps:

step S1, inputting the field to which the information needs to be acquired by a user so as to determine the range of acquiring the information;

step S2, calling a general language big model to give an upper sense word Set in the field _up Hyponym Set _down With synonym Set _syn And based thereon generating initial A ₀ Generalized search keyword library;

step S3, through a crawler tool, according to the A _i Step A, crawling data on a network by a level generalized search keyword library to construct step A _i A level information library;

step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A _i Step S3, the keyword library is searched in a level generalized mode, and the step S3 is repeated until i is larger than iteration times defined by a user, and a loop is jumped out; the multi-modal data fusion awareness system includes: a multi-modal data semanticalization subsystem and a multi-modal semantic data fusion subsystem, wherein,

the multi-mode data semanteme subsystem is used for carrying out identification conversion processing on video data, picture data, pdf data and audio data respectively and rapidly, and generating a processed database A _i-1 ；

The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry and retrieving the content A obtained by the multi-mode data semantic subsystem _i-1 Performing validity assessment, generating a validity score of each piece of information data, sorting and processing the information according to the obtained score, displaying the words with the top ranking to a user for the user to screen, and generating a new generalized search keyword library A from the screened result of the user _i+1 Jumping to the step S2, and setting A _i+1 Alternative A ₀ Utilizing the result of the first retrieval, extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database;

the multimode semantic data fusion subsystem performs validity evaluation on the retrieved content, and the multimode semantic data fusion subsystem comprises:

(1) Keyword recurrence score p _k : evaluating from the number of times the keyword appears in the text data;

(2) Timeliness score p _t : evaluating the value of the retrieved information from the time-efficiency point of view;

(3) Authority score p _a : judging the value of the e text under the requirement of the user from the perspective of authority of the text source;

final A _i-1 The effectiveness score of the e-th information in (a) is as follows

score＝p _k +p _t +p _a ；

Step S5, for the finally obtained A _i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.

2. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S2, said initial a ₀ The generalized search keyword library includes:

{A ₀ }＝Set _up ∪Set _down ∪Set _syn ，

wherein, the word set is defined in the upper sense: set (Set) _up ＝{u ₁ ，u ₂ ，u ₃ ......u _i }；

The hyponym set: set (Set) _down ＝{d ₁ ，d ₂ ，d ₃ ......d _j }；

Synonym set: set (Set) _syn ＝{s ₁ ，s ₂ ，s ₃ ......s _k }；

The cross-augmentation set is expressed as:

Set _expand ＝{u ₁ d ₁ ，u ₁ d ₂ ，u ₁ d ₃ ....u ₂ d ₁ ，u ₂ d ₂ ...u ₃ d ₁ ...u _i d _j ...u _i s _k }。

3. the traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S3, said a _i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.

4. The traffic information processing and sensing method based on multi-modal data fusion sensing as claimed in claim 1, wherein,

(1) For video data, automatically capturing images frame by frame based on a python CV2 library, and adding the obtained frame by frame data to picture data of a corresponding column; separating audio from the audio based on the python movie library, and storing the audio data in a corresponding column;

(2) For the picture data and pdf data, identifying based on a python Tesseact library by an OCR method, and adding the generated text content into the text data in the same column;

(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.

5. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs keyword recurrence scoring, comprising the steps of:

let A _i The number of occurrences of the jth keyword in the jth text data is a _l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data _k The following is shown:

6. the traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs timeliness scoring, comprising the steps of:

firstly, judging whether the e pieces of information are directly equal, if so, then at A _i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction _i-1 Generalized weight reduction is carried out; finally, for A after weight reduction _i-1 And (5) carrying out timeliness scoring on the information in the process.

7. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem is configured to reduce weight of A _i-1 The generalized weight reduction is carried out, and the method comprises the following steps:

firstly, according to the size of computer storage space and the volume of the retrieved text data, setting window length and overlap length, and dismantling the e-th text information, i _e Set of individual sentence segment elements _te ；

Then to the Set _te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained;

and finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.

8. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem pair weight-reduced a _i-1 The time-based scoring of the information in the step comprises the following steps:

calculation A _i-1 Time span coefficient T of each piece of information _range ；

Extracting according to the initial online time of the webpageThe earliest time of online of each piece of information, constructing vectors, carrying out normalization operation, and calculating to obtain an aging coefficient T _early ；

According to the time span coefficient T _range And ageing coefficient T _early Calculating a timeliness score p _t The method comprises the following steps:

p _t ＝k _range T _range +k _early T _early ；

wherein k is _range Weighting the time span coefficient; k (k) _early Is the ageing coefficient weight.

9. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as claimed in, wherein the authority score p _a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical.