CN114117174A - Multi-format data screening management system based on big data - Google Patents

Multi-format data screening management system based on big data Download PDF

Info

Publication number
CN114117174A
CN114117174A CN202111025860.7A CN202111025860A CN114117174A CN 114117174 A CN114117174 A CN 114117174A CN 202111025860 A CN202111025860 A CN 202111025860A CN 114117174 A CN114117174 A CN 114117174A
Authority
CN
China
Prior art keywords
data
information
module
screening
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111025860.7A
Other languages
Chinese (zh)
Inventor
杨子晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202111025860.7A priority Critical patent/CN114117174A/en
Publication of CN114117174A publication Critical patent/CN114117174A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Development Economics (AREA)
  • Software Systems (AREA)
  • Educational Administration (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of communication and social security systems, in particular to a multi-format data screening management system based on big data. The technical scheme comprises the following steps: the system comprises a crawler system server and a screening coding module for controlling the crawler system server to crawl required information in a Web database according to preset requirements, a scheduling terminal can index and schedule the required information from data stored in a database inside the system and integrate the required information with the information acquired by the crawler system server, and a real-time monitoring system feeds back image information and character information which are acquired directly by forming visual senses with the integrated information into the scheduling terminal and displays the image information and the character information by the scheduling terminal. The invention fully utilizes big data, indexes the required information of the big data and captures effective information, judges the trend of other people by utilizing the information released by other people on the network platform, and is beneficial to detecting cases; the working process is simplified, the data access time is greatly reduced, and the better space utilization rate is achieved.

Description

Multi-format data screening management system based on big data
Technical Field
The invention relates to the technical field of communication and social security systems, in particular to a multi-format data screening management system based on big data.
Background
Big data, or huge data, refers to the data that is too massive to be captured, managed, processed, and organized into information that helps enterprise business decision making more positive in a reasonable time through the current mainstream software tools.
Chinese patent with application publication number CN113033251A discloses a face recognition screening system based on big data, and specifically discloses the following technical scheme, the system includes: the image acquisition module is used for acquiring a target face image; the feature extraction module is used for extracting at least one corresponding feature block of the target face image; the figure matching module is used for matching in a population database according to the characteristic blocks to obtain at least one matched figure image, and the corresponding characteristics of the matched figure image are consistent with the characteristic blocks; and the information acquisition module is used for acquiring the matched person information according to the matched person image.
The above patent utilizes big data to screen and identify the face, and the face screening is comparatively popular at present, and under the condition of having no information, it is difficult to find the required movement information of people and relevant event information, and the information such as audio frequency, video, image in the big data can not be utilized to index relevant information, and the use aspect is comparatively limited.
Disclosure of Invention
The invention aims to provide a multi-format data screening management system based on big data, which can index relevant information data from big data of a network, index and capture effective information for the required information of the big data by using a data matching degree and a retrieval database, determine and judge the trend of other people by using information released by other people on a network platform, and simplify the work process.
The technical scheme of the invention is as follows: the multi-format data screening management system based on big data comprises a crawler system server, a screening coding module, a database inside the system, a scheduling terminal and a real-time monitoring system, wherein the screening coding module is used for controlling the crawler system server to crawl required information in a Web database according to a preset requirement;
the scheduling terminal can index and schedule required information from data stored in a database in the system, integrate the information with the information acquired by the crawler system server, and form image information and character information which are directly acquired by visual sense by the real-time monitoring system, and feed back the image information and the character information to the scheduling terminal and display the image information and the character information by the scheduling terminal;
the crawler system server is internally provided with a data communication system, and the data communication system comprises a crawler data acquisition module for acquiring Web database information and a data processing subsystem for integrating the acquired information;
the data communication system also comprises a manual verification module, a data acquisition module, a distributed data mining module and an array storage module; the manual verification module is used for manually processing information data needing safety verification in the crawling process of the crawler data acquisition module, directly or indirectly transmitting the information data to the data acquisition module after the information data passes the verification, and the data acquisition module stores the information data in the array storage module according to the priority of the information data; and the scheduling terminal acquires the information stored by the array storage module through the distributed data mining module.
Preferably, the data processing subsystem comprises a data identification module for identifying the information collected by the crawler data acquisition module, a matching value screening module for screening the information identified by the data identification module according to a matching degree to form a high-to-low priority order, and a data summarization module for receiving the sorted data;
the data processing subsystem further comprises a discarded item temporary storage module and an auxiliary evidence module, and the discarded item temporary storage module temporarily stores the information data with low matching degree for secondary screening;
and the evidence assisting module is used for recording the time and the position of the crawled data and the release time and the release position of the crawled information data, and transmitting the recorded data to the data summarizing module after being packaged.
Preferably, the evidence assisting module comprises a time node recording unit for recording the time when the crawling data is recorded and the releasing time of the crawled information data, and the evidence assisting module further comprises a position node recording unit for recording the position when the crawling data is recorded and the releasing position of the crawled information data;
and the time node recording unit and the position node recording unit match the recorded time information and position data with the corresponding information of the matching value screening module and then synchronously transmit the time information and position data to the data summarizing module.
Preferably, the real-time monitoring system comprises a scheduling access unit, a trajectory generation module and a real-time data updating module, wherein the real-time data updating module sorts the information data of the 'certain event' or the 'certain person' stored in the array storage module according to time nodes to form a timeline;
the track generation module forms a continuous track path with position nodes corresponding to the sequenced time nodes of the 'certain event' or the 'certain person', and the scheduling access unit completes the transmission process of the track path and the time line to the scheduling terminal.
Preferably, the dispatching terminal is internally provided with a digital remote monitoring system which can call out a corresponding position picture of a corresponding time node, and the digital remote monitoring system has a face recognition function;
the scheduling terminal also comprises a positioning tracking system and an audio recording system which are provided by an operator according to the communication protocol of the operator, and meanwhile, the operator also provides identity card information, face information and communication recording information when the operator accesses the network according to the communication protocol of the operator.
Preferably, the screening coding terminal is internally provided with a screening coding system, the screening coding system comprises a primary index unit, a secondary index unit and a plurality of index units, each index unit comprises a plurality of keyword bits, and the keyword bits are distributed according to a priority order.
Preferably, the indexing results of the primary indexing unit, the secondary indexing unit and the multiple indexing units can be displayed in a superimposed manner, and the secondary retrieval can be performed on the basis of the indexing result of the primary indexing unit.
Preferably, the text data crawled by the crawler data acquisition module comprises but is not limited to TXT/JPG/PPT/Word/PDF/BMP format, the crawled video data comprises but is not limited to AIFF/AU/MP3/MIDI/WMA/VQF// AAC/APE format, and the crawled picture data comprises but is not limited to JPG/tiff/gif/tga/svg/psd/dxf format.
Preferably, the crawling position of the crawler data acquisition module in the Web database includes a Web page URL, a server database, a program, a script, and an access log.
Preferably, the calculation formula of the matching degree is as follows:
F=[(K2*S2+K3*S3+K4*S4+....KN*SN+....KN+1*SN+1+KN+2*SN+2)/K1]*100%;
wherein, K1Indicating the total amount of the relevant information after the first keyword position is inserted and indexed, K2Representing the amount of relevant information after the second keyword insertion and indexing, and so on, KN+2Representing the amount of related information after the (N + 2) th key word insertion and indexing, S2、S3、S4.......SN+2The index is a constant 1, and F represents the matching degree of the related information after single or multiple indexes.
Compared with the prior art, the invention has the following beneficial technical effects:
(1): the invention fully utilizes big data, indexes the required information of the big data and captures effective information, and determines and judges the trend of others by utilizing the information released by others on the network platform, thereby being beneficial to the case detection;
(2): the crawler data acquisition module is utilized to meet the existing requirements and increase the degree of freedom of crawling data according to different requirements or application scenes;
(3): data needing to be verified can be crawled through a manual verification module, and a recording result is recorded by utilizing data storage to form a database with correct verification answers; if the same graphic verification or digital verification is met, corresponding answers are extracted from the database with correct verification answers and are automatically input, manual verification is avoided again, and the work process is greatly simplified.
(4): when the array storage module accesses data, the related disks in the array act together, so that the data access time is greatly reduced, and the better space utilization rate is achieved; the data in the array storage module can be rapidly extracted by utilizing the distributed data mining module.
Drawings
FIG. 1 shows a block diagram of an embodiment of the present invention;
FIG. 2 is a block diagram of the data communication system of FIG. 1;
FIG. 3 is a block diagram of the data processing subsystem of FIG. 2;
FIG. 4 is a schematic diagram of a module for screening a coding terminal according to the present invention;
FIG. 5 is a block diagram of a scheduling terminal according to the present invention;
FIG. 6 is a decision flow diagram of the crawler data intake module of the present invention;
FIG. 7 is a trend chart of the number and matching degree of keywords in the present invention;
reference numerals: 100. a Web database; 200. screening the coding terminal; 210. screening the coding system; 300. A crawler system server; 310. a data communication system; 311. a crawler data acquisition module; 312. a data processing subsystem; 3121. a data identification module; 3122. a matching value screening module; 3123. a temporary discarding item storage module; 3124. a data summarization module; 3125. a secondary certificate module; 313. a manual verification module; 314. a data acquisition module; 315. a distributed data mining module; 316. an array storage module; 400. a system internal database; 500. scheduling the terminal; 510. a digital remote monitoring system; 520. a localization tracking system; 530. An audio recording system; 600. a real-time monitoring system; 610. a scheduling access unit; 620. a trajectory generation module; 630. and a real-time data updating module.
Detailed Description
The technical solution of the present invention is further explained with reference to the accompanying drawings and specific embodiments.
Example one
As shown in fig. 1, the multi-format data screening management system based on big data according to the present invention includes a crawler system server 300 and a screening coding module 200 for controlling the crawler system server 300 to crawl required information in a Web database 100 according to preset requirements, where the preset requirements may be related event keywords, place names, person names, pictures, videos, audios, etc. that can be carried by the big data server and are related to the requirements;
the system further comprises a system internal database 400, a scheduling terminal 500 and a real-time monitoring system 600;
the scheduling terminal 500 may index and schedule the required information (the information includes the name, sex, date of birth, identification number, place of birth, residence, family members, driving certificates, hotel registration records, travel records using identification cards, improper records, etc.) from the data stored in the system internal database 400 and integrate with the information acquired by the crawler system server 300, and the real-time monitoring system 600 feeds back the integrated information to the scheduling terminal 500 and the text information, which are directly acquired by forming visual senses, to the scheduling terminal 500 and displays the information by the scheduling terminal 500;
the scheduling terminal 500 is a computer or other equipment capable of implementing the above functions;
as shown in fig. 2, the crawler system server 300 is provided with a data communication system 310, and the data communication system 310 includes a crawler data acquisition module 311 for acquiring information of the Web database 100 and a data processing subsystem 312 for performing integration processing on the acquired information;
the crawler data acquisition module 311 includes a Web crawler, which specifically includes a general Web crawler, a focused Web crawler, an incremental Web crawler, and a Deep Web crawler, and is directed to different needs or application scenarios.
For example: when the required data needs to be crawled from the whole Web database 100, a general Web crawler is used, and a depth priority strategy or an breadth priority strategy can be selectively used; selecting a principle according to whether the required information can be simply indexed;
for another example: the designated information data is needed, the focused web crawler is utilized, the information data can be crawled according to the pertinence, the stored pages are updated quickly due to small quantity, and the requirement of a group on specific information can be met;
the accessible information capacity of Deep Web crawlers is hundreds of times of that of Surface Web, and the Deep Web crawler system structure comprises six basic function modules (a crawling controller, a parser, a form analyzer, a form processor, a response analyzer and an LVS controller) and two crawler internal data structures (a URL list and an LVS table). Wherein, LVS (Label Value set) represents a label/Value set for representing a data source for filling a form;
deep Web page content can be crawled by Deep Web crawlers, and the breadth of the crawling range is improved.
In addition, a Fish Search algorithm can be utilized, the query words input by the user are taken as topics, and pages containing the query words can be regarded as being related to the topics (namely keywords), so that the crawling process is further accelerated; but the method has the limitation that the matching degree of the page and the theme cannot be evaluated;
the data communication system 310 further comprises a manual verification module 313, a data collection module 314, a distributed data mining module 315, and an array storage module 316; the manual verification module 313 performs manual processing on information data which needs to be safely verified in the crawling process of the crawler data acquisition module 311, after the information data passes the verification, the information data is directly or indirectly transmitted to the data acquisition module 314, and the data acquisition module 314 stores the information data in the array storage module 316 according to the priority of the information data;
in the data crawling process, because the crawler data acquiring module 311 can only crawl information according to a required program simply and does not have a verification function, part of the information is prevented from being lost or maliciously captured by using a pattern verification mode, a digital verification mode, a character verification mode or the like in order to avoid maliciously crawling;
when the condition that the verification is needed is met, the data processing subsystem 312 transmits the verification mode to the manual verification module 313, the manual verification module 313 performs manual processing, and records the result by using the data storage record to form a database with correct verification answers; if the same graphic verification or digital verification is met, extracting corresponding answers from the database with correct verification answers and automatically inputting the answers, so that manual verification is avoided again; the work process is greatly simplified.
For example, in a certain crawling process, a digital verification of "2 × 7= U" occurs, the crawler data acquisition module 311 transmits the digital verification to the manual verification module 313, after the manual input of U =14, the verification is passed, the crawling continues, and "2 × 7= 14" is recorded, and if "2 × 7= U" occurs again in a certain subsequent crawling process, a corresponding answer is extracted from the database with correct verification answers and automatically input, so that the manual verification is avoided again.
As shown in FIG. 6, FIG. 6 schematically illustrates a flow chart of validation during crawling of data;
the method comprises the following specific steps:
firstly, a crawler data acquisition module 311 accesses a webpage URL/server/device terminal; then opening a check page of the webpage (namely, checking the HTML code), finding the data to be extracted in the HTML code, and analyzing the corresponding data by utilizing a Python program of a crawler data acquisition module; then, entering a judging step, judging whether the data to be crawled needs to be verified, directly capturing the data if the data to be crawled does not need to be verified, transmitting a verification mode to a manual verification module 313 by a data processing subsystem 312 if the data to be crawled needs to be verified, handing the verification mode to manual processing, manually processing verification information, automatically inputting the verification information after the verification is passed, and storing the verification information and reserving the next same type of verification information for automatic screening and use; after the verification is passed, directly capturing the data;
finally, the collected data is packaged and sent to the data processing subsystem 312 or the data collection module 314, and a complete data capturing and verification information storage process is completed.
An integral answer library is formed through early-stage manual processing, and when the same kind of verification information is encountered in the later-stage information crawling process,
the scheduling terminal 500 can obtain the information stored by the array storage module 316 through the distributed data mining module 315;
the array storage module 316 is composed of a large number of large storage units, forms an N-row × M-column matrix, is used as a single disk, stores data in different disks in a segmented manner, and when accessing data, the related disks in the array act together, so that the access time of the data is greatly reduced, and meanwhile, the space utilization rate is better;
the current data mining (extracting) mode is that the extraction of the data set is slow, the data in the array storage module 316 can be rapidly extracted by using the distributed data mining module 315, and the data in the array storage module 316 is orderly arranged, so that a multi-process extracting process can be formed, and the retrieval and extraction of the data are further accelerated.
As shown in fig. 3, the data processing subsystem 312 includes a data recognition module 3121 for recognizing the information collected by the crawler data collection module 311, a matching value screening module 3122 for screening the information recognized by the data recognition module 3121 according to the matching degree to form a high-to-low priority order, and a data summarization module 3124 for receiving the sorted data.
The calculation formula of the matching degree is as follows:
F=[(K2*S2+K3*S3+K4*S4+....KN*SN+....KN+1*SN+1+KN+2*SN+2)/K1]*100%;
wherein, K1Indicating the total amount of the relevant information after the first keyword position is inserted and indexed, K2Representing the amount of relevant information after the second keyword insertion and indexing, and so on, KN+2Representing the amount of related information after the (N + 2) th key word insertion and indexing, S2、S3、S4......SN+2The index is a constant 1, and F represents the matching degree of the related information after single or multiple indexes.
The calculation formula of the matching degree can be used for quickly calculating the matching degree and quickly sequencing, and the defect that the degree of matching between the page and the theme cannot be evaluated by utilizing a focused web crawler through a Fish Search algorithm is overcome;
for convenience of understanding, the following illustrates that the number of information for inputting the first keyword digit index is 100; after the keywords of the second keyword level are input, the second keyword level is indexed again on the basis that the information quantity of the first keyword level index is 100, the relevant information quantity after the second keyword level index is 40, and the matching degree is 40% if F = [40 × 1/100] × 100% =40% as calculated by using the calculation formula of the matching degree;
inputting a keyword of a third keyword lexeme; as can be seen from the above calculation formula of the matching degree, if F = [40 × 1+30 × 1/100] × 100% =70%, the matching degree is 70%, and it should be noted again that the matching degree (i.e., F) in this embodiment is not 100% as the upper limit, and the calculation formula of the matching degree may be changed and set according to actual conditions.
As shown in fig. 7, fig. 7 shows a trend relationship between the number of keyword indexes and the matching degree, and as can be seen from fig. 7 and the calculation formula of the matching degree, the number of keyword indexes and the number of indexed related information data are positively correlated with the matching degree.
The data processing subsystem 312 further includes a discarded item temporary storage module 3123 and an auxiliary evidence module 3125, the discarded item temporary storage module 3123 temporarily retains the information data with low matching degree for secondary screening; the information data with low matching degree is not transmitted to the data summarization module 3124, so that the corresponding time of the data acquisition module 314 is reduced;
the evidence assisting module 3125 is used for entering time and position when crawling data and releasing time and position of crawled information data, and transmitting the entered data to the data summarizing module 3124 after packaging.
The evidence assisting module 3125 includes a time node recording unit which records the time when the crawled data is entered and the issuing time of the crawled information data, and the evidence assisting module 3125 further includes a position node recording unit which records the position when the crawled data is entered and the issuing position of the crawled information data;
the time node recording unit and the position node recording unit match the recorded time information and position data with the corresponding information of the matching value screening module 3122 and then synchronously transmit the matched information and position data to the data summarizing module 3124, so that the information or data is matched with the corresponding time and position, and the data acquisition module 314 is convenient for classification.
The time node recording unit is used for recording the release time and the crawled time of a certain time, and can be used as a relevant basis, and can be used as a generation basis of an activity track and a time line of a certain event or a certain person.
Example two
As shown in fig. 1 and fig. 2, in the multi-format data screening management system based on big data, on the basis of the first embodiment, in the present embodiment, the real-time monitoring system 600 includes a scheduling access unit 610, a trajectory generation module 620, and a real-time data update module 630, where the real-time data update module 630 sorts the information data of the "certain event" or the "certain person" stored by the array storage module 316 according to time nodes to form a timeline;
for example, a certain fleeing "wu-chi" publishes a piece of information at the time node a at the location node B through a certain network platform, publishes the information again at the time node C at the location node E through a certain network, and so on; after the information is crawled; forming a time line from the time node A to the time node C; a track path is formed from the position node B to the position node E, so that the action trend of the Wu-chi can be rapidly obtained, and the case detection is facilitated.
The trajectory generation module 620 forms a continuous track path with the position nodes corresponding to the sequenced "certain event" or "certain person" time nodes, and the scheduling access unit 610 completes the transmission process of the track path and the time line to the scheduling terminal 500.
As shown in fig. 5, the digital remote monitoring system 510 capable of calling out the corresponding position picture of the corresponding time node is built in the scheduling terminal 500, and the digital remote monitoring system 510 has a face recognition function; the digital remote monitoring system 510 is a current "sky eye system", and is a monitoring network formed by high-definition cameras installed in corners of each city; the system can be used for quickly searching for people in the track path.
The scheduling terminal 500 further includes a positioning tracking system 520 and an audio recording system 530 provided by the operator according to the operator communication protocol, and the operator also provides identity card information, face information and communication record information when accessing the network according to the operator communication protocol; because the system can not directly position the position of the person with improper behavior and the communication record; under the premise of reasonable legality, the position and the call information of the person with the improper behavior can be searched by the operator through the base station.
EXAMPLE III
As shown in fig. 4, compared with the first embodiment or the second embodiment, in the multi-format data screening management system based on big data provided by the present invention, in the present embodiment, a screening coding terminal 200 is internally provided with a screening coding system 210, where the screening coding system 210 includes a primary index unit, a secondary index unit, and a multi-index unit, each index unit includes a plurality of keyword bits, and the keyword bits are distributed according to a priority order;
the index results of the primary index unit, the secondary index unit and the multiple index units can be displayed in a superposed mode, and secondary retrieval can be performed on the basis of the index result of the primary index unit;
the first embodiment is as follows: the index results of the primary index unit, the secondary index unit and the multiple index units are displayed in a superimposed manner, the primary index unit is used for searching the keywords "wu-zi", "wu-zi related event a" and "wu-zi related event occurrence scenario B", and the secondary index unit is used for searching the keywords "X province" and "Y city", so that the information related to the keywords is displayed in a superimposed manner (namely, the information related to "wu-zi", "wu-zi related event a", "wu-zi related event occurrence scenario B", "X province" and "Y city" appears independently and repeatedly).
Example two: and the secondary index unit searches the keywords 'Wu-Yi', 'Wu-Yi related event occurrence scene B' and 'Wu-Yi related event A' by utilizing the primary index unit, and displays the unified related information of 'Wu-Yi', 'Wu-Yi related event A', 'Wu-Yi related event occurrence scene B', 'X province' and 'Y city', so that the accuracy is higher.
The keywords of the index may be sorted according to a priority order, such as "wu-tong" or "wu-tong related event a" or "wu-tong related event occurrence scenario B", and if "wu-tong related event a" is taken as the first priority, the priority may be locked, that is, information unrelated to "wu-tong related event a" does not appear in the information after the index even if "wu-tong" or "wu-tong related event occurrence scenario B" is included.
The text data crawled by the crawler data acquisition module 311 comprises but is not limited to TXT/JPG/PPT/Word/PDF/BMP format, the audio data crawled comprises but is not limited to AIFF/AU/MP3/MIDI/WMA/VQF// AAC/APE format, and the picture data crawled comprises but is not limited to JPG/tiff/gif/tga/svg/psd/dxf format;
the voice of the person is unique, if relevant audio data are crawled, the voice data can be compared with call data, and the case detection can be accelerated by both text data and picture data.
The crawler data acquiring module 311 includes a Web page URL, a server database, a program, a script, and an access log at a crawling location of the Web database 100, and obtains an operation of a certain person by crawling the access log, for example, when a certain Web page is browsed, the operation of the certain person is written into a file by a Web server in a recorded line, and corresponding information can be obtained by analyzing the file.
It should be noted that the above-mentioned "information", "information data" and "data" are information data crawled by the crawler data acquiring module 311 or information or data related to keyword information input by the screening and encoding system 200, the information or data may exist in the form of pictures, videos, texts or links, and may be directly or indirectly obtained, and the above-mentioned "crawling" and "crawling" are means for acquiring corresponding information data by the crawler data acquiring module 311, and for those skilled in the art, specific meanings of the above-mentioned terms in the present invention can be understood according to specific situations.
The above embodiments are merely some preferred embodiments of the present invention, and those skilled in the art can make various alternative modifications and combinations of the above embodiments based on the technical solution of the present invention and the related teaching of the above embodiments.

Claims (10)

1. Multi-format data screening management system based on big data, its characterized in that: the system comprises a crawler system server, a screening coding module for controlling the crawler system server to crawl required information in a Web database according to a preset requirement, an internal database of the system, a scheduling terminal and a real-time monitoring system;
the scheduling terminal can index and schedule required information from data stored in a database in the system, integrate the information with the information acquired by the crawler system server, and form image information and character information which are directly acquired by visual sense by the real-time monitoring system, and feed back the image information and the character information to the scheduling terminal and display the image information and the character information by the scheduling terminal;
the crawler system server is internally provided with a data communication system, and the data communication system comprises a crawler data acquisition module for acquiring Web database information and a data processing subsystem for integrating the acquired information;
the data communication system also comprises a manual verification module, a data acquisition module, a distributed data mining module and an array storage module; the manual verification module is used for manually processing information data needing safety verification in the crawling process of the crawler data acquisition module, directly or indirectly transmitting the information data to the data acquisition module after the information data passes the verification, and the data acquisition module stores the information data in the array storage module according to the priority of the information data; and the scheduling terminal acquires the information stored by the array storage module through the distributed data mining module.
2. The big data based multi-format data screening management system as claimed in claim 1, wherein the data processing subsystem comprises a data recognition module for recognizing the information collected by the crawler data collection module, a matching value screening module for screening the information recognized by the data recognition module according to a matching degree to form a high-to-low priority order, and a data summarization module for receiving the sorted data;
the data processing subsystem further comprises a discarded item temporary storage module and an auxiliary evidence module, and the discarded item temporary storage module temporarily stores the information data with low matching degree for secondary screening;
and the evidence assisting module is used for recording the time and the position of the crawled data and the release time and the release position of the crawled information data, and transmitting the recorded data to the data summarizing module after being packaged.
3. The big-data-based multi-format data screening management system according to claim 2, wherein the evidence-assisting module includes a time node recording unit that records a time when the crawled data is entered and a time when the crawled information data is released, and the evidence-assisting module further includes a location node recording unit that records a location when the crawled data is entered and a position when the crawled information data is released;
and the time node recording unit and the position node recording unit match the recorded time information and position data with the corresponding information of the matching value screening module and then synchronously transmit the time information and position data to the data summarizing module.
4. The big data-based multi-format data screening management system as claimed in claim 3, wherein the real-time monitoring system comprises a scheduling access unit, a trajectory generation module and a real-time data update module, and the real-time data update module sorts the information data of the "certain event" or the "certain person" stored in the array storage module according to time nodes to form a timeline;
the track generation module forms a continuous track path with position nodes corresponding to the sequenced time nodes of the 'certain event' or the 'certain person', and the scheduling access unit completes the transmission process of the track path and the time line to the scheduling terminal.
5. The big-data-based multi-format data screening management system as claimed in claim 4, wherein the scheduling terminal is internally provided with a digital remote monitoring system which can call out a corresponding position picture of a corresponding time node, and the digital remote monitoring system has a face recognition function; the scheduling terminal also comprises a positioning tracking system and an audio recording system which are provided by an operator according to the communication protocol of the operator, and meanwhile, the operator also provides identity card information, face information and communication recording information when the operator accesses the network according to the communication protocol of the operator.
6. The big-data-based multi-format data screening management system as claimed in any one of claims 1 to 5, wherein the screening coding terminal is provided with a built-in screening coding system, the screening coding system comprises a primary indexing unit, a secondary indexing unit and a plurality of indexing units, each indexing unit comprises a plurality of keyword bits, and the keyword bits are distributed according to a priority order.
7. The big data-based multi-format data screening management system as claimed in claim 6, wherein the indexing results of the primary indexing unit, the secondary indexing unit and the multiple indexing units can be displayed in a superimposed manner, and the secondary retrieval can be performed on the basis of the indexing result of the primary indexing unit.
8. The big data based multi-format data screening management system according to any one of claims 1-5, wherein the text data crawled by the crawler data intake module includes but is not limited to TXT/JPG/PPT/Word/PDF/BMP format, the video data crawled includes but is not limited to AIFF/AU/MP3/MIDI/WMA/VQF// AAC/APE format, and the picture data crawled includes but is not limited to JPG/tiff/gif/tga/svg/psd/dxf format.
9. The big data based multi-format data screening management system as claimed in claim 8, wherein the crawler data intake module includes a Web page URL, a server database, a program, a script, and an access log at a crawling location of the Web database.
10. The big data based multi-format data screening management system according to claim 7, wherein the calculation formula of the matching degree is:
F=[(K2*S2+K3*S3+K4*S4+....KN*SN+....KN+1*SN+1+KN+2*SN+2)/K1]*100%;
wherein, K1 represents the total number of the relevant information after the first keyword position is inserted and indexed, K2 represents the number of the relevant information after the second keyword insertion and indexing, and so on, and KN +2 represents the number of the relevant information after the (N + 2) th keyword insertion and indexing; s2, S3, and S4.
CN202111025860.7A 2021-09-02 2021-09-02 Multi-format data screening management system based on big data Pending CN114117174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111025860.7A CN114117174A (en) 2021-09-02 2021-09-02 Multi-format data screening management system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111025860.7A CN114117174A (en) 2021-09-02 2021-09-02 Multi-format data screening management system based on big data

Publications (1)

Publication Number Publication Date
CN114117174A true CN114117174A (en) 2022-03-01

Family

ID=80441215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111025860.7A Pending CN114117174A (en) 2021-09-02 2021-09-02 Multi-format data screening management system based on big data

Country Status (1)

Country Link
CN (1) CN114117174A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591578A (en) * 2024-01-18 2024-02-23 山东科技大学 Data mining system and mining method based on big data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117591578A (en) * 2024-01-18 2024-02-23 山东科技大学 Data mining system and mining method based on big data
CN117591578B (en) * 2024-01-18 2024-04-09 山东科技大学 Data mining system and mining method based on big data

Similar Documents

Publication Publication Date Title
CN101601053B (en) Identifying images using face recognition
CN109144968B (en) Data distribution management system
CN102314916B (en) Video processing method and system
US20040249808A1 (en) Query expansion using query logs
Delen et al. A holistic framework for knowledge discovery and management
CN111427968A (en) Key person holographic archive construction method and device based on knowledge graph
US10030986B2 (en) Incident response analytic maps
CN113407785B (en) Data processing method and system based on distributed storage system
CN111723256A (en) Government affair user portrait construction method and system based on information resource library
CN104834739B (en) Internet information storage system
CN115293723A (en) Network public opinion heat analysis system based on big data analysis
CN112257740B (en) Knowledge graph-based image hidden danger identification method and system
CN115828112A (en) Fault event response method and device, electronic equipment and storage medium
CN111860523B (en) Intelligent recording system and method for sound image files
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN110727805A (en) Community knowledge graph construction method and system
CN109711298A (en) The method and system of efficient face characteristic value retrieval based on faiss
CN114117174A (en) Multi-format data screening management system based on big data
CN108388672A (en) Lookup method, device and the computer readable storage medium of video
CN113254572B (en) Electronic document classification supervision system based on cloud platform
CN113794819A (en) Intelligent management method, system, device and medium for epidemic prevention place
CN110245037B (en) Hive user operation behavior restoration method based on logs
CN115794798B (en) Market supervision informatization standard management and dynamic maintenance system and method
CN101510211A (en) Multimedia data processing system and method
CN112528056B (en) Double-index field data retrieval system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20220301