CN117370932A - Traffic information processing and sensing method based on multi-mode data fusion sensing - Google Patents

Traffic information processing and sensing method based on multi-mode data fusion sensing Download PDF

Info

Publication number
CN117370932A
CN117370932A CN202311376178.1A CN202311376178A CN117370932A CN 117370932 A CN117370932 A CN 117370932A CN 202311376178 A CN202311376178 A CN 202311376178A CN 117370932 A CN117370932 A CN 117370932A
Authority
CN
China
Prior art keywords
data
information
data fusion
sensing
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311376178.1A
Other languages
Chinese (zh)
Inventor
周紫君
张扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Transportation Sciences
Original Assignee
China Academy of Transportation Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Academy of Transportation Sciences filed Critical China Academy of Transportation Sciences
Priority to CN202311376178.1A priority Critical patent/CN117370932A/en
Publication of CN117370932A publication Critical patent/CN117370932A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Development Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Educational Administration (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a traffic information processing and sensing method based on multi-mode data fusion sensing, which comprises the following steps: inputting the domain to which the information needs to be acquired by a user so as to determine the range of acquiring the information; calling a general language big model to give an hypernym Set of the belonging field up Hyponym Set down And co-exist withMeaning word Set syn And based thereon generating initial A 0 Generalized search keyword library; according to A by means of a crawler tool i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library; importing crawling data into a multi-mode data fusion sensing system to generate A i The keyword library is searched in a level generalized mode, and the iteration number is increased until i is larger than the iteration number defined by a user; for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results. The invention can realize the screening of the obtained information by an effective screening mechanism and improve the efficiency of related work.

Description

Traffic information processing and sensing method based on multi-mode data fusion sensing
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a traffic information processing and sensing method based on multi-mode data fusion sensing.
Background
For the transportation industry, the research content of the transportation industry includes but is not limited to national strategy, regulation, industry specification, architecture of highway related departments of each country and other types of information belonging to top-level design, and test points and lands of various high and new technologies, besides materials, structures, performances, patents and algorithms from technical points of view, and the information can reflect the development degree of one country in the transportation field and the development trend of the industry in the world, so that the information is the key information in the transportation engineering field.
However, the series of information exists in various forms such as texts, pictures, publications, radio recordings, video information and the like on the internet, information acquisition by a traditional crawler can ignore other forms of information except texts, on the other hand, the information is huge in volume, content is acquired by a traditional crawler technology, manual screening is not different from sea fishing needle one by one, a large number of traffic fields start to release videos or pdf format publications from media staff along with the development of media at the present stage, the subdivision fields of a certain traffic transportation industry are tracked, unique insight is provided, and the information is actually high-quality information which is screened, but the information is difficult to acquire by a traditional information acquisition method. In addition, because the traffic engineering industry and the national infrastructure construction are indiscriminately developed, the research direction is not only limited to the technology, but also includes the directions of policies, regulations and the like, the value difference of the same information in different research directions is obvious, and an effective screening mechanism is needed to screen the obtained information so as to improve the efficiency of related work.
At present, no system capable of collecting, screening and sensing trend aiming at information in the transportation industry exists, conventional information collection and sensing trend only depend on a large amount of manual searching, or manual screening is performed after information is captured by a crawler, so that time and labor are wasted, the retrieval efficiency is low, repeated contents are more, and the time is long.
(1) From the search method, the existing crawlers can only search in the conventional search engine or the domain name appointed by the user according to the keywords input by the user, the information acquired by the search method greatly depends on the familiarity degree of the user to the studied field, all contents crawled by the crawlers are searched based on the regular expression input by the user, and for the transportation industry, the same content entities have different expression modes and can obviously influence the search result, so the fault tolerance rate of the result obtained by the conventional crawlers is low, and the practical use effect is not ideal due to high repeatability.
(2) In terms of the data types crawled by crawlers, although text data, picture data, publication data and audio/video data can be stored, for the transportation field, the data types are rich, each data is huge in volume, useful parts are screened out from the results by manpower, which is not different from a sea fishing needle, an effective means is lacked for cleaning, classifying and refining the data, and the data in multiple modes are fused and utilized in the searching process, so that the searching efficiency and accuracy are reversely improved.
(3) The information result obtained by the conventional crawler is often only a list containing webpage titles, data and original links, the content is complex, the development trend of the industry is difficult to directly show, a large amount of manpower is needed for screening and verification, and the movement of the foreign transportation field is not easy to grasp in time.
(4) Because the traffic engineering industry and the national infrastructure construction are indistinct in development, the research direction is not limited to the technology, but also includes the directions of policies, regulations and the like, the value difference of the same information in different research directions is obvious, and an effective screening mechanism is needed to automatically score and screen the value of the obtained information so as to improve the efficiency of related work, which is not possessed by the crawler technology.
Therefore, from the perspective of data acquisition and data screening, the technology at the present stage cannot meet the requirements of effective information acquisition in the transportation field, and development of a novel information collection and trend perception method based on multi-mode big data fusion perception aiming at the transportation field is urgently needed to meet the requirements of information collection, screening and trend perception in the transportation industry at the present stage.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a traffic transportation information processing and sensing method based on multi-mode data fusion sensing, which realizes that an effective screening mechanism screens obtained information and improves the efficiency of related work.
The embodiment of the invention provides a traffic information processing and sensing method based on multi-mode data fusion sensing, which comprises the following steps:
step S1, inputting the field to which the information needs to be acquired by a user so as to determine the range of acquiring the information;
step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 Generalized search keyword library;
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library;
step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i Step S3, the keyword library is searched in a level generalized mode, and the step S3 is repeated until i is larger than iteration times defined by a user, and a loop is jumped out; the multi-modal data fusion awareness system includes: a multi-modal data semanticalization subsystem and a multi-modal semantic data fusion subsystem, wherein,
the multi-mode data semanteme subsystem is used for carrying out identification conversion processing on video data, picture data, pdf data and audio data respectively and rapidly, and generating a processed database A i-1
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry and retrieving the content A obtained by the multi-mode data semantic subsystem i-1 Performing validity assessment, generating a validity score of each piece of information data, sorting and processing the information according to the obtained score, displaying the words with the top ranking to a user for the user to screen, and generating a new generalized search keyword library A from the screened result of the user i+1 Jumping to the step S2, and setting A i+1 Alternative A 0 Utilizing the result of the first retrieval, extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database;
the multimode semantic data fusion subsystem performs validity evaluation on the retrieved content, and the multimode semantic data fusion subsystem comprises:
(1) Keyword recurrence score p k : evaluating from the number of times the keyword appears in the text data;
(2) Timeliness score p t : evaluating the value of the retrieved information from the time-efficiency point of view;
(3) Authority score p a : judging the value of the e text under the requirement of the user from the perspective of authority of the text source;
final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
Preferably, in the step S2, the initial a 0 The generalized search keyword library includes:
{A 0 }=Set up ∪Set down ∪Set syn
wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
in any of the above embodiments, preferably, in the step S3, the a i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
In any of the above schemes, preferably, (1) for video data, automatically capturing a picture frame by frame based on a pythonCV2 library and adding the obtained frame by frame data to picture data of a corresponding column; separating audio from the audio based on the python movie library, and storing the audio data in a corresponding column;
(2) For the picture data and pdf data, identifying based on a python Tesseact library by an OCR method, and adding the generated text content into the text data in the same column;
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
In any of the above schemes, preferably, the multi-mode semantic data fusion subsystem performs keyword recurrence scoring, including the following steps:
let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
in any of the above schemes, preferably, the multi-mode semantic data fusion subsystem performs timeliness scoring, including the following steps:
firstly, judging whether the e pieces of information are directly equal, if so, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
In any of the above schemes, preferably, the multimode semantic data fusion subsystem is configured to reduce the weight of A i-1 The generalized weight reduction is carried out, and the method comprises the following steps:
firstly, according to the size of computer storage space and the volume of the retrieved text data, setting window length and overlap length, and dismantling the e-th text information, i e Set of individual sentence segment elements te
Then to the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained;
and finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
In any of the above schemes, preferably, the multi-mode semantic data fusion subsystem pair weight-reduced A i-1 The time-based scoring of the information in the step comprises the following steps:
calculation A i-1 Time span coefficient T of each piece of information range
Extracting the earliest online time of each piece of information according to the initial online time of the webpage, constructing a vector, carrying out normalization operation, and calculating to obtain an aging coefficient T early
According to the time span coefficient T range And ageing coefficient T early At the time of calculationEfficacy score p t The method comprises the following steps:
p t =k range T range +k early T early
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
In any of the above schemes, it is preferable that the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical.
The traffic information processing and sensing method based on multi-mode data fusion sensing provided by the embodiment of the invention has the following beneficial effects:
(1) Through borrowing the mature language big model, through associating the upper and lower sense words of the keyword input by the user, the keyword search library is built on the basis of the upper and lower sense words of the keyword input by the user through an algorithm, the search range is enlarged by taking the keyword of the user as the center, and the problem of low quality of the collected information caused by insufficient familiarity of the user to the field during the first search can be reduced to a certain extent.
(2) Aiming at the problems that the information in the transportation field exists in a rich form, and the traditional crawler can only simply download data and cannot analyze the data, the invention constructs a multi-mode data semanteme system based on the voice recognition technology and the OCR recognition technology, converts the multi-mode data into text data, and realizes the full utilization of various types of information in the prior art.
(3) Aiming at the difference of research problems in the transportation field, a multi-mode semantic data fusion perception system is designed, and the novel two main directions of policy research and technical research in the transportation field are designed, and from three angles of theme relevance, information timeliness (time span and time priority), information source value degree, an evaluation algorithm of the information value degree in the transportation field is designed, the semantic information data are subjected to deep screening according to the conventional knowledge of the transportation industry, and the trend perception in two basic fields is realized through value degree ranking, information timeliness grading and word frequency statistics.
(4) Based on the industry trend analyzed by the multi-mode semantic data fusion perception system, a recommended keyword class is provided for a user after primary search by a natural language processing technology, so that a keyword library for next search is constructed by a semi-supervision method according to the selection of the user, the expansion of the instant oriented keywords based on the cognition of the transportation industry is realized, the search range is greatly improved, the recording of invalid information is reduced, and the search efficiency is improved.
Drawings
Fig. 1 is a flowchart of a traffic information processing and sensing method based on multi-modal data fusion sensing according to an embodiment of the present invention.
Detailed Description
For a further understanding of the present invention, the present invention will be described in detail with reference to the following examples.
As shown in fig. 1, the traffic information processing and sensing method based on multi-mode data fusion sensing in the embodiment of the invention comprises the following steps:
step S1, the user inputs the domain to which the information needs to be acquired, so as to determine the range of information acquisition.
Step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 And (5) broad search keyword library.
The general large language model can be any open source or commercial large language model; the keywords in the initial generalized search keyword library include
{A 0 }=Set up ∪Set down ∪Set syn
Wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information repository.
In this step, A i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
Specifically, the crawler tool can acquire text information in the webpage, download data such as pictures, pdfs, audios and videos, and the like, simultaneously can read the online time of the webpage, and finally the acquired data form A i A level information repository. Each of the intelligence data items includes: the key words adopted in the retrieval process are the original links of the web page, the main domain name, the initial online time of the web page and the information data in the web page. Wherein, the information data in the webpage comprises: text data, picture data, audio-video data, pdf publications, and the like.
Step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i And (3) the step (S3) is repeated until i is greater than the iteration number defined by the user, and the loop is jumped out.
In an embodiment of the present invention, a multi-modal data fusion awareness system includes: a multi-modal data semanteme subsystem and a multi-modal semantic data fusion subsystem.
The multi-mode data semanteme subsystem is used for respectively carrying out identification conversion processing on video data, picture data, pdf data and audio data in an urgent manner to generate a processed database A i-1
(1) For video data, firstly, carrying out automatic frame-by-frame screenshot on the video based on a python CV2 library and adding the obtained frame-by-frame data into picture data of a corresponding column; the audio portion is separated based on the python movie library and saved to the audio data of the corresponding column.
(2) And for the picture data and pdf data, identifying the picture data and the pdf data by an OCR method based on a python Tesseact library, and adding the generated text content into the text data in the same column.
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
(4) The processed database is named A i-1
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry, evaluating the validity of the retrieved content, generating a validity score of each piece of information data, sorting and processing the information according to the obtained information, displaying the words with the top ranking to the user for the user to screen, and generating a new generalized retrieval keyword library A from the screened result of the user i+1 And jumping to the step S3.
The multi-mode semantic data fusion subsystem performs validity evaluation on the retrieved content, and comprises the following steps:
(1) Keyword recurrence score p k : the evaluation is made from the number of times the keyword appears in the text data.
Let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
(2) Timeliness score p t : the value of the retrieved information is evaluated from the time-efficiency point of view.
First, it is determined whether the e pieces of information are directly equal,if the same, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
Multi-mode semantic data fusion subsystem pair A i-1 The generalized weight reduction is carried out after the weight reduction, and the method comprises the following steps:
first, the chunk segmentation is performed. And determining the length a and the overlapping length b of the sliding window according to the size of the storage space of the computer and the volume of the retrieved text data. Where a is much greater than b. Thus, the e text information with the length d is disassembled to obtain the text information containing i e Set of individual sentence segment elements te
Then, compression encoding is performed. For the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained.
Specifically, based on an open source corpus, a Python NLTK library pair Set is adopted te Each sentence segment element is cut according to words, and word stem extraction is carried out, so that the influence of word shapes and non-meaning words is eliminated. Calculating word frequency, and performing common hash transformation on words with a 20% of the word frequency to obtain a hash code corresponding to each word; the hash is binary code, and when the hash is 0, the hash is 1, namely 00101- & gt when the hash is 1>[-1,-1,1,-1,1]And storing the transformed result as a row vector, and carrying out column summation on the row vector to obtain the compression coding of the sentence segment element.
And finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
Specifically, the hamming distance of each sentence segment element compression coding in the collection is calculated (the number of the calculation results 1 is calculated by carrying out exclusive OR operation on two codes), the calculation results are in reverse order, the combination of the previous c% is taken as similar (c is determined by a user), and the rootSimilarly, in A i-1 The duplicate items are deleted and extracted as subset set.
A after weight reduction of multi-mode semantic data fusion subsystem pair i-1 Time-efficiency grading is carried out on the information in the process, and A is carried out after weight reduction i-1 The e-th information in the system is evaluated on timeliness mainly through the duration of the information in time and the time of first occurrence. For policy research, the wider the time length range of the policy file in the information, the larger the supporting effect of the policy file on the industry is proved, the higher the timeliness score is, while the technical research is opposite, the larger the time span is, the more mature the research content is, and the worse the innovation is, so that the policy file needs to be evaluated according to different user demands.
Specifically, the timeliness score comprises the following steps:
(1) Calculation A i-1 Time span coefficient T of each piece of information range
According to the initial online time of the webpage, calculating A i-1 For the information without repeated items, the time span is 0, for the information with similar subsets, the time span is the difference (recorded according to days) between the information with the latest time and the information with the earliest time in the similar subsets, the time spans of all information items are constructed into new vectors, and normalization operation is carried out, so that the obtained value is the time span coefficient.
(2) Extracting the earliest online time of each piece of information according to the initial online time of the webpage, constructing a vector, carrying out normalization operation, and calculating to obtain an aging coefficient T early
According to the time span coefficient T range And ageing coefficient T early Calculating a timeliness score p t The method comprises the following steps:
p t =k range T range +k early T early
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
Wherein, for high and new technical information tasks, k range =0.2,k early =0.8; for mature technical information task k range =0.6,k early =0.4; for policy support intelligence k range =0.7,k early =0.3; for general policy task k range =0.6,k early =0.4。
(3) Authority score p a : from the perspective of authority of the text source, the value of the e text under the requirement of the user is judged.
In an embodiment of the invention, the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical, etc. It should be noted that authority score p a The data source of (a) is not limited to the above example, but may also include other types, which are adjusted and set by the user according to the need, and will not be described herein.
Specifically, the sources of the data can be mainly classified into five types of sources such as news, forum, government, intelligent library and technology according to the main domain name, the information values of different sources have obvious differences in different research directions, the transportation industry mainly comprises technical research and policy research, the evaluation tables of the five sources in different research directions are shown in the following table 1, and in practical application, the judgment of the source of the data mechanism is carried out according to the main domain name entry of the data and a preset domain name-mechanism mapping library. As shown in Table 1, 1-news class p news Class p of 2-forum blog 3-government class p goverment 4-Intelligence library class p thinktank Class 5-technique p tech
Table 1 scoring table for five sources
P news P blog P goverment P thinktank P tech
Technical class 0.2 0.2 0.4 0.6 0.8
Policy class 0.6 0.6 0.8 0.6 0.2
(4) Final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a
Sorting the informations according to the scores, reserving the first e x b (b is the user-defined proportion) informations data, extracting phrases based on a pythonnltk library, calculating word frequency, sorting, reserving the first c (c is the user-defined proportion) phrases, classifying the phrases through the hamming distance of the phrase sources which fall down when calculating similarity, displaying the first ten words with highest word frequency in each classification as the representatives of the subclasses to the user, enabling the user to quickly screen the keywords, and generating a new generalized search keyword library A according to the screened results i+1 Step 2 is skipped. Will A i+1 Alternative A 0 And utilizing the result of the first retrieval, and extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database.
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
According to the traffic transportation information processing and sensing method based on multi-mode data fusion sensing, the problems that the conventional traffic information collecting system is low in information collecting efficiency in the traffic transportation field, effective information is little, and effective industry development trend information is difficult to provide for researchers and policy makers in related industries are solved, an effective screening mechanism is achieved to automatically score and screen obtained information, efficiency of related work is improved, data are cleaned, classified and refined by effective means, and data of multiple modes are fused and utilized in the searching process, searching efficiency and accuracy are improved reversely, and the movement of the foreign traffic transportation field is timely grasped.
The traffic information processing and sensing method based on multi-mode data fusion sensing provided by the embodiment of the invention has the following beneficial effects:
(1) Through borrowing the mature language big model, through associating the upper and lower sense words of the keyword input by the user, the keyword search library is built on the basis of the upper and lower sense words of the keyword input by the user through an algorithm, the search range is enlarged by taking the keyword of the user as the center, and the problem of low quality of the collected information caused by insufficient familiarity of the user to the field during the first search can be reduced to a certain extent.
(2) Aiming at the problems that the information in the transportation field exists in a rich form, and the traditional crawler can only simply download data and cannot analyze the data, the invention constructs a multi-mode data semanteme system based on the voice recognition technology and the OCR recognition technology, converts the multi-mode data into text data, and realizes the full utilization of various types of information in the prior art.
(3) Aiming at the difference of research problems in the transportation field, a multi-mode semantic data fusion perception system is designed, and the novel two main directions of policy research and technical research in the transportation field are designed, and from three angles of theme relevance, information timeliness (time span and time priority), information source value degree, an evaluation algorithm of the information value degree in the transportation field is designed, the semantic information data are subjected to deep screening according to the conventional knowledge of the transportation industry, and the trend perception in two basic fields is realized through value degree ranking, information timeliness grading and word frequency statistics.
(4) Based on the industry trend analyzed by the multi-mode semantic data fusion perception system, a recommended keyword class is provided for a user after primary search by a natural language processing technology, so that a keyword library for next search is constructed by a semi-supervision method according to the selection of the user, the expansion of the instant oriented keywords based on the cognition of the transportation industry is realized, the search range is greatly improved, the recording of invalid information is reduced, and the search efficiency is improved.
The specific description is as follows: the technical scheme of the invention relates to a plurality of parameters, and the beneficial effects and remarkable progress of the invention can be obtained by comprehensively considering the synergistic effect among the parameters. In addition, the value ranges of all the parameters in the technical scheme are obtained through a large number of tests, and aiming at each parameter and the mutual combination of all the parameters, the inventor records a large number of test data, and the specific test data are not disclosed herein for a long period of time.
It will be appreciated by those skilled in the art that the method for processing and sensing traffic information based on multi-modal data fusion sensing of the present invention includes any combination of the above-described summary and detailed description of the present invention and the portions shown in the drawings, and is limited in scope and does not describe each of these combinations in a one-to-one manner for simplicity of the description. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A traffic information processing and sensing method based on multi-mode data fusion sensing is characterized by comprising the following steps:
step S1, inputting the field to which the information needs to be acquired by a user so as to determine the range of acquiring the information;
step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 Generalized search keyword library;
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library;
step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i Step S3, the keyword library is searched in a level generalized mode, and the step S3 is repeated until i is larger than iteration times defined by a user, and a loop is jumped out; the multi-modal data fusion awareness system includes: a multi-modal data semanticalization subsystem and a multi-modal semantic data fusion subsystem, wherein,
the multi-mode data semanteme subsystem is used for carrying out identification conversion processing on video data, picture data, pdf data and audio data respectively and rapidly, and generating a processed database A i-1
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry and retrieving the content A obtained by the multi-mode data semantic subsystem i-1 Performing validity assessment, generating a validity score of each piece of information data, sorting and processing the information according to the obtained score, displaying the words with the top ranking to a user for the user to screen, and generating a new generalized search keyword library A from the screened result of the user i+1 Jumping to the step S2, and setting A i+1 Alternative A 0 Utilizing the result of the first retrieval, extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database;
the multimode semantic data fusion subsystem performs validity evaluation on the retrieved content, and the multimode semantic data fusion subsystem comprises:
(1) Keyword recurrence score p k : evaluating from the number of times the keyword appears in the text data;
(2) Timeliness score p t : evaluating the value of the retrieved information from the time-efficiency point of view;
(3) Authority score p a : judging the value of the e text under the requirement of the user from the perspective of authority of the text source;
final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
2. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S2, said initial a 0 The generalized search keyword library includes:
{A 0 }=Set up ∪Set down ∪Set syn
wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
3. the traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S3, said a i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
4. The traffic information processing and sensing method based on multi-modal data fusion sensing as claimed in claim 1, wherein,
(1) For video data, automatically capturing images frame by frame based on a python CV2 library, and adding the obtained frame by frame data to picture data of a corresponding column; separating audio from the audio based on the python movie library, and storing the audio data in a corresponding column;
(2) For the picture data and pdf data, identifying based on a python Tesseact library by an OCR method, and adding the generated text content into the text data in the same column;
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
5. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs keyword recurrence scoring, comprising the steps of:
let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
6. the traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs timeliness scoring, comprising the steps of:
firstly, judging whether the e pieces of information are directly equal, if so, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
7. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem is configured to reduce weight of A i-1 The generalized weight reduction is carried out, and the method comprises the following steps:
firstly, according to the size of computer storage space and the volume of the retrieved text data, setting window length and overlap length, and dismantling the e-th text information, i e Set of individual sentence segment elements te
Then to the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained;
and finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
8. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem pair weight-reduced a i-1 The time-based scoring of the information in the step comprises the following steps:
calculation A i-1 Time span coefficient T of each piece of information range
Extracting according to the initial online time of the webpageThe earliest time of online of each piece of information, constructing vectors, carrying out normalization operation, and calculating to obtain an aging coefficient T early
According to the time span coefficient T range And ageing coefficient T early Calculating a timeliness score p t The method comprises the following steps:
p t =k range T range +k early T early
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
9. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as claimed in, wherein the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical.
CN202311376178.1A 2023-10-23 2023-10-23 Traffic information processing and sensing method based on multi-mode data fusion sensing Pending CN117370932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311376178.1A CN117370932A (en) 2023-10-23 2023-10-23 Traffic information processing and sensing method based on multi-mode data fusion sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311376178.1A CN117370932A (en) 2023-10-23 2023-10-23 Traffic information processing and sensing method based on multi-mode data fusion sensing

Publications (1)

Publication Number Publication Date
CN117370932A true CN117370932A (en) 2024-01-09

Family

ID=89407318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311376178.1A Pending CN117370932A (en) 2023-10-23 2023-10-23 Traffic information processing and sensing method based on multi-mode data fusion sensing

Country Status (1)

Country Link
CN (1) CN117370932A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118210383A (en) * 2024-05-21 2024-06-18 南通亚森信息科技有限公司 Information input system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118210383A (en) * 2024-05-21 2024-06-18 南通亚森信息科技有限公司 Information input system

Similar Documents

Publication Publication Date Title
CN109271477B (en) Method and system for constructing classified corpus by means of Internet
CN108280114B (en) Deep learning-based user literature reading interest analysis method
KR100756921B1 (en) Method of classifying documents, computer readable record medium on which program for executing the method is recorded
US20160034512A1 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
CN112035658B (en) Enterprise public opinion monitoring method based on deep learning
CN109829104A (en) Pseudo-linear filter model information search method and system based on semantic similarity
CN106126619A (en) A kind of video retrieval method based on video content and system
US20090070346A1 (en) Systems and methods for clustering information
CN110442777A (en) Pseudo-linear filter model information search method and system based on BERT
Jotheeswaran et al. OPINION MINING USING DECISION TREE BASED FEATURE SELECTION THROUGH MANHATTAN HIERARCHICAL CLUSTER MEASURE.
CN111291188B (en) Intelligent information extraction method and system
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN115270738B (en) Research and report generation method, system and computer storage medium
CN110555154B (en) Theme-oriented information retrieval method
US20130339373A1 (en) Method and system of filtering and recommending documents
CN117370932A (en) Traffic information processing and sensing method based on multi-mode data fusion sensing
Aruleba et al. A full text retrieval system in a digital library environment
CN116010552A (en) Engineering cost data analysis system and method based on keyword word library
CN113742292A (en) Multi-thread data retrieval and retrieved data access method based on AI technology
CN117216008A (en) Knowledge graph-based archive multi-mode intelligent compiling method and system
AlSulaim et al. Prediction of Anime Series' Success using Sentiment Analysis and Deep Learning
CN117056392A (en) Big data retrieval service system and method based on dynamic hypergraph technology
CN113779981A (en) Recommendation method and device based on pointer network and knowledge graph
Azeemi et al. RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News Feeds
Bakar et al. A survey: Framework to develop retrieval algorithms of indexing techniques on learning material

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination