CN117370932A - Traffic information processing and sensing method based on multi-mode data fusion sensing - Google Patents
Traffic information processing and sensing method based on multi-mode data fusion sensing Download PDFInfo
- Publication number
- CN117370932A CN117370932A CN202311376178.1A CN202311376178A CN117370932A CN 117370932 A CN117370932 A CN 117370932A CN 202311376178 A CN202311376178 A CN 202311376178A CN 117370932 A CN117370932 A CN 117370932A
- Authority
- CN
- China
- Prior art keywords
- data
- information
- data fusion
- sensing
- library
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000010365 information processing Effects 0.000 title claims abstract description 16
- 238000011161 development Methods 0.000 claims abstract description 11
- 230000009193 crawling Effects 0.000 claims abstract description 8
- 230000008520 organization Effects 0.000 claims abstract description 4
- 239000013585 weight reducing agent Substances 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 12
- 230000018109 developmental process Effects 0.000 claims description 10
- 230000032683 aging Effects 0.000 claims description 9
- 230000006835 compression Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 abstract description 14
- 230000007246 mechanism Effects 0.000 abstract description 6
- 238000011160 research Methods 0.000 description 18
- 238000005516 engineering process Methods 0.000 description 13
- 230000008447 perception Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000019771 cognition Effects 0.000 description 2
- 238000009440 infrastructure construction Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000008093 supporting effect Effects 0.000 description 1
- 230000002195 synergetic effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000004642 transportation engineering Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
- G06F18/256—Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Tourism & Hospitality (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Economics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Development Economics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Educational Administration (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a traffic information processing and sensing method based on multi-mode data fusion sensing, which comprises the following steps: inputting the domain to which the information needs to be acquired by a user so as to determine the range of acquiring the information; calling a general language big model to give an hypernym Set of the belonging field up Hyponym Set down And co-exist withMeaning word Set syn And based thereon generating initial A 0 Generalized search keyword library; according to A by means of a crawler tool i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library; importing crawling data into a multi-mode data fusion sensing system to generate A i The keyword library is searched in a level generalized mode, and the iteration number is increased until i is larger than the iteration number defined by a user; for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results. The invention can realize the screening of the obtained information by an effective screening mechanism and improve the efficiency of related work.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a traffic information processing and sensing method based on multi-mode data fusion sensing.
Background
For the transportation industry, the research content of the transportation industry includes but is not limited to national strategy, regulation, industry specification, architecture of highway related departments of each country and other types of information belonging to top-level design, and test points and lands of various high and new technologies, besides materials, structures, performances, patents and algorithms from technical points of view, and the information can reflect the development degree of one country in the transportation field and the development trend of the industry in the world, so that the information is the key information in the transportation engineering field.
However, the series of information exists in various forms such as texts, pictures, publications, radio recordings, video information and the like on the internet, information acquisition by a traditional crawler can ignore other forms of information except texts, on the other hand, the information is huge in volume, content is acquired by a traditional crawler technology, manual screening is not different from sea fishing needle one by one, a large number of traffic fields start to release videos or pdf format publications from media staff along with the development of media at the present stage, the subdivision fields of a certain traffic transportation industry are tracked, unique insight is provided, and the information is actually high-quality information which is screened, but the information is difficult to acquire by a traditional information acquisition method. In addition, because the traffic engineering industry and the national infrastructure construction are indiscriminately developed, the research direction is not only limited to the technology, but also includes the directions of policies, regulations and the like, the value difference of the same information in different research directions is obvious, and an effective screening mechanism is needed to screen the obtained information so as to improve the efficiency of related work.
At present, no system capable of collecting, screening and sensing trend aiming at information in the transportation industry exists, conventional information collection and sensing trend only depend on a large amount of manual searching, or manual screening is performed after information is captured by a crawler, so that time and labor are wasted, the retrieval efficiency is low, repeated contents are more, and the time is long.
(1) From the search method, the existing crawlers can only search in the conventional search engine or the domain name appointed by the user according to the keywords input by the user, the information acquired by the search method greatly depends on the familiarity degree of the user to the studied field, all contents crawled by the crawlers are searched based on the regular expression input by the user, and for the transportation industry, the same content entities have different expression modes and can obviously influence the search result, so the fault tolerance rate of the result obtained by the conventional crawlers is low, and the practical use effect is not ideal due to high repeatability.
(2) In terms of the data types crawled by crawlers, although text data, picture data, publication data and audio/video data can be stored, for the transportation field, the data types are rich, each data is huge in volume, useful parts are screened out from the results by manpower, which is not different from a sea fishing needle, an effective means is lacked for cleaning, classifying and refining the data, and the data in multiple modes are fused and utilized in the searching process, so that the searching efficiency and accuracy are reversely improved.
(3) The information result obtained by the conventional crawler is often only a list containing webpage titles, data and original links, the content is complex, the development trend of the industry is difficult to directly show, a large amount of manpower is needed for screening and verification, and the movement of the foreign transportation field is not easy to grasp in time.
(4) Because the traffic engineering industry and the national infrastructure construction are indistinct in development, the research direction is not limited to the technology, but also includes the directions of policies, regulations and the like, the value difference of the same information in different research directions is obvious, and an effective screening mechanism is needed to automatically score and screen the value of the obtained information so as to improve the efficiency of related work, which is not possessed by the crawler technology.
Therefore, from the perspective of data acquisition and data screening, the technology at the present stage cannot meet the requirements of effective information acquisition in the transportation field, and development of a novel information collection and trend perception method based on multi-mode big data fusion perception aiming at the transportation field is urgently needed to meet the requirements of information collection, screening and trend perception in the transportation industry at the present stage.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a traffic transportation information processing and sensing method based on multi-mode data fusion sensing, which realizes that an effective screening mechanism screens obtained information and improves the efficiency of related work.
The embodiment of the invention provides a traffic information processing and sensing method based on multi-mode data fusion sensing, which comprises the following steps:
step S1, inputting the field to which the information needs to be acquired by a user so as to determine the range of acquiring the information;
step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 Generalized search keyword library;
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library;
step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i Step S3, the keyword library is searched in a level generalized mode, and the step S3 is repeated until i is larger than iteration times defined by a user, and a loop is jumped out; the multi-modal data fusion awareness system includes: a multi-modal data semanticalization subsystem and a multi-modal semantic data fusion subsystem, wherein,
the multi-mode data semanteme subsystem is used for carrying out identification conversion processing on video data, picture data, pdf data and audio data respectively and rapidly, and generating a processed database A i-1 ;
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry and retrieving the content A obtained by the multi-mode data semantic subsystem i-1 Performing validity assessment, generating a validity score of each piece of information data, sorting and processing the information according to the obtained score, displaying the words with the top ranking to a user for the user to screen, and generating a new generalized search keyword library A from the screened result of the user i+1 Jumping to the step S2, and setting A i+1 Alternative A 0 Utilizing the result of the first retrieval, extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database;
the multimode semantic data fusion subsystem performs validity evaluation on the retrieved content, and the multimode semantic data fusion subsystem comprises:
(1) Keyword recurrence score p k : evaluating from the number of times the keyword appears in the text data;
(2) Timeliness score p t : evaluating the value of the retrieved information from the time-efficiency point of view;
(3) Authority score p a : judging the value of the e text under the requirement of the user from the perspective of authority of the text source;
final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a ;
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
Preferably, in the step S2, the initial a 0 The generalized search keyword library includes:
{A 0 }=Set up ∪Set down ∪Set syn ,
wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
in any of the above embodiments, preferably, in the step S3, the a i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
In any of the above schemes, preferably, (1) for video data, automatically capturing a picture frame by frame based on a pythonCV2 library and adding the obtained frame by frame data to picture data of a corresponding column; separating audio from the audio based on the python movie library, and storing the audio data in a corresponding column;
(2) For the picture data and pdf data, identifying based on a python Tesseact library by an OCR method, and adding the generated text content into the text data in the same column;
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
In any of the above schemes, preferably, the multi-mode semantic data fusion subsystem performs keyword recurrence scoring, including the following steps:
let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
in any of the above schemes, preferably, the multi-mode semantic data fusion subsystem performs timeliness scoring, including the following steps:
firstly, judging whether the e pieces of information are directly equal, if so, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
In any of the above schemes, preferably, the multimode semantic data fusion subsystem is configured to reduce the weight of A i-1 The generalized weight reduction is carried out, and the method comprises the following steps:
firstly, according to the size of computer storage space and the volume of the retrieved text data, setting window length and overlap length, and dismantling the e-th text information, i e Set of individual sentence segment elements te ;
Then to the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained;
and finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
In any of the above schemes, preferably, the multi-mode semantic data fusion subsystem pair weight-reduced A i-1 The time-based scoring of the information in the step comprises the following steps:
calculation A i-1 Time span coefficient T of each piece of information range ;
Extracting the earliest online time of each piece of information according to the initial online time of the webpage, constructing a vector, carrying out normalization operation, and calculating to obtain an aging coefficient T early ;
According to the time span coefficient T range And ageing coefficient T early At the time of calculationEfficacy score p t The method comprises the following steps:
p t =k range T range +k early T early ;
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
In any of the above schemes, it is preferable that the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical.
The traffic information processing and sensing method based on multi-mode data fusion sensing provided by the embodiment of the invention has the following beneficial effects:
(1) Through borrowing the mature language big model, through associating the upper and lower sense words of the keyword input by the user, the keyword search library is built on the basis of the upper and lower sense words of the keyword input by the user through an algorithm, the search range is enlarged by taking the keyword of the user as the center, and the problem of low quality of the collected information caused by insufficient familiarity of the user to the field during the first search can be reduced to a certain extent.
(2) Aiming at the problems that the information in the transportation field exists in a rich form, and the traditional crawler can only simply download data and cannot analyze the data, the invention constructs a multi-mode data semanteme system based on the voice recognition technology and the OCR recognition technology, converts the multi-mode data into text data, and realizes the full utilization of various types of information in the prior art.
(3) Aiming at the difference of research problems in the transportation field, a multi-mode semantic data fusion perception system is designed, and the novel two main directions of policy research and technical research in the transportation field are designed, and from three angles of theme relevance, information timeliness (time span and time priority), information source value degree, an evaluation algorithm of the information value degree in the transportation field is designed, the semantic information data are subjected to deep screening according to the conventional knowledge of the transportation industry, and the trend perception in two basic fields is realized through value degree ranking, information timeliness grading and word frequency statistics.
(4) Based on the industry trend analyzed by the multi-mode semantic data fusion perception system, a recommended keyword class is provided for a user after primary search by a natural language processing technology, so that a keyword library for next search is constructed by a semi-supervision method according to the selection of the user, the expansion of the instant oriented keywords based on the cognition of the transportation industry is realized, the search range is greatly improved, the recording of invalid information is reduced, and the search efficiency is improved.
Drawings
Fig. 1 is a flowchart of a traffic information processing and sensing method based on multi-modal data fusion sensing according to an embodiment of the present invention.
Detailed Description
For a further understanding of the present invention, the present invention will be described in detail with reference to the following examples.
As shown in fig. 1, the traffic information processing and sensing method based on multi-mode data fusion sensing in the embodiment of the invention comprises the following steps:
step S1, the user inputs the domain to which the information needs to be acquired, so as to determine the range of information acquisition.
Step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 And (5) broad search keyword library.
The general large language model can be any open source or commercial large language model; the keywords in the initial generalized search keyword library include
{A 0 }=Set up ∪Set down ∪Set syn ,
Wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information repository.
In this step, A i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
Specifically, the crawler tool can acquire text information in the webpage, download data such as pictures, pdfs, audios and videos, and the like, simultaneously can read the online time of the webpage, and finally the acquired data form A i A level information repository. Each of the intelligence data items includes: the key words adopted in the retrieval process are the original links of the web page, the main domain name, the initial online time of the web page and the information data in the web page. Wherein, the information data in the webpage comprises: text data, picture data, audio-video data, pdf publications, and the like.
Step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i And (3) the step (S3) is repeated until i is greater than the iteration number defined by the user, and the loop is jumped out.
In an embodiment of the present invention, a multi-modal data fusion awareness system includes: a multi-modal data semanteme subsystem and a multi-modal semantic data fusion subsystem.
The multi-mode data semanteme subsystem is used for respectively carrying out identification conversion processing on video data, picture data, pdf data and audio data in an urgent manner to generate a processed database A i-1 。
(1) For video data, firstly, carrying out automatic frame-by-frame screenshot on the video based on a python CV2 library and adding the obtained frame-by-frame data into picture data of a corresponding column; the audio portion is separated based on the python movie library and saved to the audio data of the corresponding column.
(2) And for the picture data and pdf data, identifying the picture data and the pdf data by an OCR method based on a python Tesseact library, and adding the generated text content into the text data in the same column.
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
(4) The processed database is named A i-1 。
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry, evaluating the validity of the retrieved content, generating a validity score of each piece of information data, sorting and processing the information according to the obtained information, displaying the words with the top ranking to the user for the user to screen, and generating a new generalized retrieval keyword library A from the screened result of the user i+1 And jumping to the step S3.
The multi-mode semantic data fusion subsystem performs validity evaluation on the retrieved content, and comprises the following steps:
(1) Keyword recurrence score p k : the evaluation is made from the number of times the keyword appears in the text data.
Let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
(2) Timeliness score p t : the value of the retrieved information is evaluated from the time-efficiency point of view.
First, it is determined whether the e pieces of information are directly equal,if the same, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
Multi-mode semantic data fusion subsystem pair A i-1 The generalized weight reduction is carried out after the weight reduction, and the method comprises the following steps:
first, the chunk segmentation is performed. And determining the length a and the overlapping length b of the sliding window according to the size of the storage space of the computer and the volume of the retrieved text data. Where a is much greater than b. Thus, the e text information with the length d is disassembled to obtain the text information containing i e Set of individual sentence segment elements te 。
Then, compression encoding is performed. For the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained.
Specifically, based on an open source corpus, a Python NLTK library pair Set is adopted te Each sentence segment element is cut according to words, and word stem extraction is carried out, so that the influence of word shapes and non-meaning words is eliminated. Calculating word frequency, and performing common hash transformation on words with a 20% of the word frequency to obtain a hash code corresponding to each word; the hash is binary code, and when the hash is 0, the hash is 1, namely 00101- & gt when the hash is 1>[-1,-1,1,-1,1]And storing the transformed result as a row vector, and carrying out column summation on the row vector to obtain the compression coding of the sentence segment element.
And finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
Specifically, the hamming distance of each sentence segment element compression coding in the collection is calculated (the number of the calculation results 1 is calculated by carrying out exclusive OR operation on two codes), the calculation results are in reverse order, the combination of the previous c% is taken as similar (c is determined by a user), and the rootSimilarly, in A i-1 The duplicate items are deleted and extracted as subset set.
A after weight reduction of multi-mode semantic data fusion subsystem pair i-1 Time-efficiency grading is carried out on the information in the process, and A is carried out after weight reduction i-1 The e-th information in the system is evaluated on timeliness mainly through the duration of the information in time and the time of first occurrence. For policy research, the wider the time length range of the policy file in the information, the larger the supporting effect of the policy file on the industry is proved, the higher the timeliness score is, while the technical research is opposite, the larger the time span is, the more mature the research content is, and the worse the innovation is, so that the policy file needs to be evaluated according to different user demands.
Specifically, the timeliness score comprises the following steps:
(1) Calculation A i-1 Time span coefficient T of each piece of information range 。
According to the initial online time of the webpage, calculating A i-1 For the information without repeated items, the time span is 0, for the information with similar subsets, the time span is the difference (recorded according to days) between the information with the latest time and the information with the earliest time in the similar subsets, the time spans of all information items are constructed into new vectors, and normalization operation is carried out, so that the obtained value is the time span coefficient.
(2) Extracting the earliest online time of each piece of information according to the initial online time of the webpage, constructing a vector, carrying out normalization operation, and calculating to obtain an aging coefficient T early 。
According to the time span coefficient T range And ageing coefficient T early Calculating a timeliness score p t The method comprises the following steps:
p t =k range T range +k early T early ;
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
Wherein, for high and new technical information tasks, k range =0.2,k early =0.8; for mature technical information task k range =0.6,k early =0.4; for policy support intelligence k range =0.7,k early =0.3; for general policy task k range =0.6,k early =0.4。
(3) Authority score p a : from the perspective of authority of the text source, the value of the e text under the requirement of the user is judged.
In an embodiment of the invention, the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical, etc. It should be noted that authority score p a The data source of (a) is not limited to the above example, but may also include other types, which are adjusted and set by the user according to the need, and will not be described herein.
Specifically, the sources of the data can be mainly classified into five types of sources such as news, forum, government, intelligent library and technology according to the main domain name, the information values of different sources have obvious differences in different research directions, the transportation industry mainly comprises technical research and policy research, the evaluation tables of the five sources in different research directions are shown in the following table 1, and in practical application, the judgment of the source of the data mechanism is carried out according to the main domain name entry of the data and a preset domain name-mechanism mapping library. As shown in Table 1, 1-news class p news Class p of 2-forum blog 3-government class p goverment 4-Intelligence library class p thinktank Class 5-technique p tech 。
Table 1 scoring table for five sources
P news | P blog | P goverment | P thinktank | P tech | |
Technical class | 0.2 | 0.2 | 0.4 | 0.6 | 0.8 |
Policy class | 0.6 | 0.6 | 0.8 | 0.6 | 0.2 |
(4) Final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a 。
Sorting the informations according to the scores, reserving the first e x b (b is the user-defined proportion) informations data, extracting phrases based on a pythonnltk library, calculating word frequency, sorting, reserving the first c (c is the user-defined proportion) phrases, classifying the phrases through the hamming distance of the phrase sources which fall down when calculating similarity, displaying the first ten words with highest word frequency in each classification as the representatives of the subclasses to the user, enabling the user to quickly screen the keywords, and generating a new generalized search keyword library A according to the screened results i+1 Step 2 is skipped. Will A i+1 Alternative A 0 And utilizing the result of the first retrieval, and extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database.
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
According to the traffic transportation information processing and sensing method based on multi-mode data fusion sensing, the problems that the conventional traffic information collecting system is low in information collecting efficiency in the traffic transportation field, effective information is little, and effective industry development trend information is difficult to provide for researchers and policy makers in related industries are solved, an effective screening mechanism is achieved to automatically score and screen obtained information, efficiency of related work is improved, data are cleaned, classified and refined by effective means, and data of multiple modes are fused and utilized in the searching process, searching efficiency and accuracy are improved reversely, and the movement of the foreign traffic transportation field is timely grasped.
The traffic information processing and sensing method based on multi-mode data fusion sensing provided by the embodiment of the invention has the following beneficial effects:
(1) Through borrowing the mature language big model, through associating the upper and lower sense words of the keyword input by the user, the keyword search library is built on the basis of the upper and lower sense words of the keyword input by the user through an algorithm, the search range is enlarged by taking the keyword of the user as the center, and the problem of low quality of the collected information caused by insufficient familiarity of the user to the field during the first search can be reduced to a certain extent.
(2) Aiming at the problems that the information in the transportation field exists in a rich form, and the traditional crawler can only simply download data and cannot analyze the data, the invention constructs a multi-mode data semanteme system based on the voice recognition technology and the OCR recognition technology, converts the multi-mode data into text data, and realizes the full utilization of various types of information in the prior art.
(3) Aiming at the difference of research problems in the transportation field, a multi-mode semantic data fusion perception system is designed, and the novel two main directions of policy research and technical research in the transportation field are designed, and from three angles of theme relevance, information timeliness (time span and time priority), information source value degree, an evaluation algorithm of the information value degree in the transportation field is designed, the semantic information data are subjected to deep screening according to the conventional knowledge of the transportation industry, and the trend perception in two basic fields is realized through value degree ranking, information timeliness grading and word frequency statistics.
(4) Based on the industry trend analyzed by the multi-mode semantic data fusion perception system, a recommended keyword class is provided for a user after primary search by a natural language processing technology, so that a keyword library for next search is constructed by a semi-supervision method according to the selection of the user, the expansion of the instant oriented keywords based on the cognition of the transportation industry is realized, the search range is greatly improved, the recording of invalid information is reduced, and the search efficiency is improved.
The specific description is as follows: the technical scheme of the invention relates to a plurality of parameters, and the beneficial effects and remarkable progress of the invention can be obtained by comprehensively considering the synergistic effect among the parameters. In addition, the value ranges of all the parameters in the technical scheme are obtained through a large number of tests, and aiming at each parameter and the mutual combination of all the parameters, the inventor records a large number of test data, and the specific test data are not disclosed herein for a long period of time.
It will be appreciated by those skilled in the art that the method for processing and sensing traffic information based on multi-modal data fusion sensing of the present invention includes any combination of the above-described summary and detailed description of the present invention and the portions shown in the drawings, and is limited in scope and does not describe each of these combinations in a one-to-one manner for simplicity of the description. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (9)
1. A traffic information processing and sensing method based on multi-mode data fusion sensing is characterized by comprising the following steps:
step S1, inputting the field to which the information needs to be acquired by a user so as to determine the range of acquiring the information;
step S2, calling a general language big model to give an upper sense word Set in the field up Hyponym Set down With synonym Set syn And based thereon generating initial A 0 Generalized search keyword library;
step S3, through a crawler tool, according to the A i Step A, crawling data on a network by a level generalized search keyword library to construct step A i A level information library;
step S4, importing the crawling data into a multi-mode data fusion sensing system to generate A i Step S3, the keyword library is searched in a level generalized mode, and the step S3 is repeated until i is larger than iteration times defined by a user, and a loop is jumped out; the multi-modal data fusion awareness system includes: a multi-modal data semanticalization subsystem and a multi-modal semantic data fusion subsystem, wherein,
the multi-mode data semanteme subsystem is used for carrying out identification conversion processing on video data, picture data, pdf data and audio data respectively and rapidly, and generating a processed database A i-1 ;
The multi-mode semantic data fusion subsystem is used for constructing a scoring system according to the characteristics of the transportation industry and retrieving the content A obtained by the multi-mode data semantic subsystem i-1 Performing validity assessment, generating a validity score of each piece of information data, sorting and processing the information according to the obtained score, displaying the words with the top ranking to a user for the user to screen, and generating a new generalized search keyword library A from the screened result of the user i+1 Jumping to the step S2, and setting A i+1 Alternative A 0 Utilizing the result of the first retrieval, extracting the keywords through data fusion and scoring optimization to enlarge the retrieval database;
the multimode semantic data fusion subsystem performs validity evaluation on the retrieved content, and the multimode semantic data fusion subsystem comprises:
(1) Keyword recurrence score p k : evaluating from the number of times the keyword appears in the text data;
(2) Timeliness score p t : evaluating the value of the retrieved information from the time-efficiency point of view;
(3) Authority score p a : judging the value of the e text under the requirement of the user from the perspective of authority of the text source;
final A i-1 The effectiveness score of the e-th information in (a) is as follows
score=p k +p t +p a ;
Step S5, for the finally obtained A i The level information knowledge base content generates a classified information knowledge base according to the organization sources, and the development trend of the user attention field is obtained through time scoring and word frequency statistics results.
2. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S2, said initial a 0 The generalized search keyword library includes:
{A 0 }=Set up ∪Set down ∪Set syn ,
wherein, the word set is defined in the upper sense: set (Set) up ={u 1 ,u 2 ,u 3 ......u i };
The hyponym set: set (Set) down ={d 1 ,d 2 ,d 3 ......d j };
Synonym set: set (Set) syn ={s 1 ,s 2 ,s 3 ......s k };
The cross-augmentation set is expressed as:
Set expand ={u 1 d 1 ,u 1 d 2 ,u 1 d 3 ....u 2 d 1 ,u 2 d 2 ...u 3 d 1 ...u i d j ...u i s k }。
3. the traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein in said step S3, said a i The level intelligence library comprises a plurality of pieces of intelligence data, and each piece of intelligence data comprises: the method comprises the steps of adopting keywords, original links of the webpage, a main domain name, initial online time of the webpage and information data in the webpage during retrieval.
4. The traffic information processing and sensing method based on multi-modal data fusion sensing as claimed in claim 1, wherein,
(1) For video data, automatically capturing images frame by frame based on a python CV2 library, and adding the obtained frame by frame data to picture data of a corresponding column; separating audio from the audio based on the python movie library, and storing the audio data in a corresponding column;
(2) For the picture data and pdf data, identifying based on a python Tesseact library by an OCR method, and adding the generated text content into the text data in the same column;
(3) For audio data, call speech recognition API to convert audio into text and add to the text data in the same column.
5. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs keyword recurrence scoring, comprising the steps of:
let A i The number of occurrences of the jth keyword in the jth text data is a l The number of words of the text data is AWN, and the keyword reproduction score p of the e text data k The following is shown:
6. the traffic intelligence processing and sensing method based on multi-modal data fusion sensing as set forth in claim 1, wherein the multi-modal semantic data fusion subsystem performs timeliness scoring, comprising the steps of:
firstly, judging whether the e pieces of information are directly equal, if so, then at A i-1 Repeating items are deleted, and the repeating items are extracted as a subset set; then, for A after weight reduction i-1 Generalized weight reduction is carried out; finally, for A after weight reduction i-1 And (5) carrying out timeliness scoring on the information in the process.
7. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem is configured to reduce weight of A i-1 The generalized weight reduction is carried out, and the method comprises the following steps:
firstly, according to the size of computer storage space and the volume of the retrieved text data, setting window length and overlap length, and dismantling the e-th text information, i e Set of individual sentence segment elements te ;
Then to the Set te Each sentence segment element is cut according to words, word stem extraction is carried out, word frequency is calculated, hash coding is carried out, and compression coding data of the sentence segment element are obtained;
and finally, calculating the hamming distance of each sentence segment element compression coding in the set, performing similarity calculation, and extracting the repeated item as a subset set.
8. The traffic information processing and sensing method based on multi-modal data fusion sensing as set forth in claim 6, wherein said multi-modal semantic data fusion subsystem pair weight-reduced a i-1 The time-based scoring of the information in the step comprises the following steps:
calculation A i-1 Time span coefficient T of each piece of information range ;
Extracting according to the initial online time of the webpageThe earliest time of online of each piece of information, constructing vectors, carrying out normalization operation, and calculating to obtain an aging coefficient T early ;
According to the time span coefficient T range And ageing coefficient T early Calculating a timeliness score p t The method comprises the following steps:
p t =k range T range +k early T early ;
wherein k is range Weighting the time span coefficient; k (k) early Is the ageing coefficient weight.
9. The traffic intelligence processing and sensing method based on multi-modal data fusion sensing as claimed in, wherein the authority score p a The data sources selected according to the main domain name comprise: news, forum, government, intelligent, technical.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311376178.1A CN117370932A (en) | 2023-10-23 | 2023-10-23 | Traffic information processing and sensing method based on multi-mode data fusion sensing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311376178.1A CN117370932A (en) | 2023-10-23 | 2023-10-23 | Traffic information processing and sensing method based on multi-mode data fusion sensing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117370932A true CN117370932A (en) | 2024-01-09 |
Family
ID=89407318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311376178.1A Pending CN117370932A (en) | 2023-10-23 | 2023-10-23 | Traffic information processing and sensing method based on multi-mode data fusion sensing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117370932A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118210383A (en) * | 2024-05-21 | 2024-06-18 | 南通亚森信息科技有限公司 | Information input system |
-
2023
- 2023-10-23 CN CN202311376178.1A patent/CN117370932A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118210383A (en) * | 2024-05-21 | 2024-06-18 | 南通亚森信息科技有限公司 | Information input system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271477B (en) | Method and system for constructing classified corpus by means of Internet | |
CN108280114B (en) | Deep learning-based user literature reading interest analysis method | |
KR100756921B1 (en) | Method of classifying documents, computer readable record medium on which program for executing the method is recorded | |
US20160034512A1 (en) | Context-based metadata generation and automatic annotation of electronic media in a computer network | |
CN112035658B (en) | Enterprise public opinion monitoring method based on deep learning | |
CN109829104A (en) | Pseudo-linear filter model information search method and system based on semantic similarity | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
US20090070346A1 (en) | Systems and methods for clustering information | |
CN110442777A (en) | Pseudo-linear filter model information search method and system based on BERT | |
Jotheeswaran et al. | OPINION MINING USING DECISION TREE BASED FEATURE SELECTION THROUGH MANHATTAN HIERARCHICAL CLUSTER MEASURE. | |
CN111291188B (en) | Intelligent information extraction method and system | |
US20180341686A1 (en) | System and method for data search based on top-to-bottom similarity analysis | |
CN115270738B (en) | Research and report generation method, system and computer storage medium | |
CN110555154B (en) | Theme-oriented information retrieval method | |
US20130339373A1 (en) | Method and system of filtering and recommending documents | |
CN117370932A (en) | Traffic information processing and sensing method based on multi-mode data fusion sensing | |
Aruleba et al. | A full text retrieval system in a digital library environment | |
CN116010552A (en) | Engineering cost data analysis system and method based on keyword word library | |
CN113742292A (en) | Multi-thread data retrieval and retrieved data access method based on AI technology | |
CN117216008A (en) | Knowledge graph-based archive multi-mode intelligent compiling method and system | |
AlSulaim et al. | Prediction of Anime Series' Success using Sentiment Analysis and Deep Learning | |
CN117056392A (en) | Big data retrieval service system and method based on dynamic hypergraph technology | |
CN113779981A (en) | Recommendation method and device based on pointer network and knowledge graph | |
Azeemi et al. | RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News Feeds | |
Bakar et al. | A survey: Framework to develop retrieval algorithms of indexing techniques on learning material |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |