CN113392286B - Big data information acquisition system - Google Patents

Big data information acquisition system Download PDF

Info

Publication number
CN113392286B
CN113392286B CN202110653306.7A CN202110653306A CN113392286B CN 113392286 B CN113392286 B CN 113392286B CN 202110653306 A CN202110653306 A CN 202110653306A CN 113392286 B CN113392286 B CN 113392286B
Authority
CN
China
Prior art keywords
information
characters
database
character string
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110653306.7A
Other languages
Chinese (zh)
Other versions
CN113392286A (en
Inventor
邢家辉
黄毓桦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hongbo Information Technology Co ltd
Original Assignee
Shenzhen Hongbo Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Hongbo Information Technology Co ltd filed Critical Shenzhen Hongbo Information Technology Co ltd
Priority to CN202110653306.7A priority Critical patent/CN113392286B/en
Publication of CN113392286A publication Critical patent/CN113392286A/en
Application granted granted Critical
Publication of CN113392286B publication Critical patent/CN113392286B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Abstract

The invention relates to a big data information acquisition system, which comprises a first database, a second database and a data processing module, wherein the first database is used for storing data acquired from external data into the first database through a first transmission channel at a transmission speed V10; the acquisition module is connected with the first database through a second transmission channel and used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit and a screening unit, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20; and the central control module is respectively connected with the first database, the acquisition module and the display module. Through the adjustment to first transmission speed and second transmission speed for carry out transmission speed's matching according to the data transmission volume when carrying out storage after the transmission and data screening, realize the high-efficient transmission of data, improve data acquisition efficiency.

Description

Big data information acquisition system
Technical Field
The invention relates to the technical field of data processing, in particular to a big data information acquisition system.
Background
Data (Data) is a representation of facts, concepts or instructions that can be processed by either manual or automated means. The basic purpose of data processing is to extract and derive valuable, meaningful data for certain people from large, possibly chaotic, unintelligible amounts of data.
For enterprises, data generated every day and data to be processed are numerous and complex, whether the acquired data are comprehensive or not and whether the data processing mode is correct or not influence enterprise decisions, and if the data are not comprehensive or are not processed properly, serious consequences or even irrecoverable huge losses are brought to the enterprises.
In the prior art, how to acquire useful data for enterprises from massive data becomes a concern of each large enterprise, searching can be performed through a search engine to acquire related data, but whether the data is collected comprehensively cannot be judged.
Disclosure of Invention
Therefore, the invention provides a big data information acquisition system which can solve the problem of incomplete information acquisition.
In order to achieve the above object, the present invention provides a big data information collecting system, including:
a first database for storing data collected from external data into the first database through a first transmission channel at a transmission speed V10;
the acquisition module is connected with the first database through a second transmission channel and used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit and a screening unit, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20;
the display module is used for displaying the screened data in a classified manner so as to visually display the data matched with the key character strings;
the central control module is respectively connected with the first database, the acquisition module and the display module;
when data in the first database is screened, the structure of the input key character string is adjusted according to the data quantity stored in the first database, if the data quantity in the first database is less than the first quantity n1, the structure of the key character string is reduced, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is increased;
if the second number n2> is larger than or equal to the first number n1, the structure of the key character string, the first transmission speed and the second transmission speed do not need to be adjusted;
if the data amount in the first database is larger than or equal to the second number n2, the structure of the key character string is increased, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is reduced.
Further, when data collected from external data is stored in a first database, the database name of the database to be monitored is set in the first database, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also performed in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is performed, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;
and if the first database and the database corresponding to the name of the database to be monitored belong to the same local area network, establishing a communication network transmission channel.
Further, a detection period matrix T (T1, T2, T3) is arranged in the first database, where T1 denotes a first detection period, T2 denotes a second detection period, T3 denotes a third detection period, and T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to the first frequency f1, the first detection period T1 is used to monitor whether the database corresponding to the name of the database to be monitored is updated;
if the updating frequency of the database corresponding to the name of the database to be monitored belongs to the second frequency f2, monitoring whether the database is updated or not by adopting a second detection period T2;
if the update frequency of the database corresponding to the name of the database to be monitored belongs to the third frequency f3, a third detection period T3 is adopted to monitor whether the database corresponding to the name of the database to be monitored is updated, wherein f1 is less than f2 and less than f 3.
Further, the first database includes a plurality of pieces of information N1, N2, N3 … Nn, the length of each piece of information is L1, L2, L3 … Ln, the length of the key string is set to be the length Ln of the standard string, when it is determined whether the information in the first database includes the key string, the length of each piece of information is compared with the length of the standard string, if Li < the length Ln of the standard string, it indicates that the information does not include the key string, and it is not necessary to collect the information;
if Li is larger than or equal to the length ln of the standard character string, establishing a first information base matrix M (M1, M2 … Mk) for the information base meeting the length requirement, wherein k is smaller than n, when the first information base is judged, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is larger than 90%, the first data Mi contains the key character string;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
Further, when determining whether the first information in the first information base contains the key character string, the method further comprises: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
Further, in the comparison process, if the first information Mi is compared k times, and if the comparison result of 0.2 × k times indicates that the first information Mi includes the key character string, it is determined that the first information Mi does not include the character of the standard character string.
Further, when comparing the n characters selected from the first information Mi with the characters of the standard character string, if the overlapping rate of the n characters with the characters of the standard character string is less than or equal to 90%, finding the first character bit with difference, reselecting the n characters from the first difference bit, comparing the n characters with the characters of the standard character string, if the overlapping rate is higher than 90%, indicating that the first information includes the key character string, and if the overlapping rate is less than or equal to 90%, requiring further judgment.
Further, when it is necessary to further determine whether the first information includes the key string, the approximate information base includes a plurality of approximate character strings of the key string, which are Y1 and Y2 … Yn, respectively, the approximate character strings are similar or similar character strings of the key string, the first information is further determined according to the approximate information base, whether the first information includes the similar or similar fields of the key string is determined, if so, the first information includes the key string, and if not, the first information is determined not to include the key string.
Further, when comparing, adding a conversion code to each approximate character string, updating the approximate character strings Y1 and Y2 … Yn in the approximate information base to Y11 and Y12 … Y1n, starting from the 1 st character in the first information Mi, selecting n characters, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the kth character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
the method includes the steps of selecting n characters from the kth character in the first information Mi, comparing the n characters with approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is larger than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings.
Compared with the prior art, the method has the advantages that the first quantity and the second quantity are set, the data quantity in the first database is compared with the first quantity, the structure of the key character string input by the acquisition unit, the first transmission speed and the second transmission speed are adjusted according to the comparison result, so that the acquisition module is adjusted according to the quantity in the quantity library when acquiring data, the acquired data are more accurate, the transmission speeds are matched according to the data transmission quantity when the data are stored and screened after being transmitted by adjusting the first transmission speed and the second transmission speed, the data are efficiently transmitted, and the data acquisition efficiency is improved.
Particularly, the period for carrying out mirror image adjustment is determined by the updating frequency of the database corresponding to the monitored database name in the first database, and if the updating frequency in the database is high and the updating period is short, the period time length for carrying out mirror image adjustment is correspondingly small, so that mirror image backup can be carried out in time when the database is updated, and the first database can be updated in time; if the updating frequency is low, the updating period is long, the period time length for carrying out mirror image adjustment is long, the backup times are reduced, and the transmission pressure of the first transmission channel is reduced.
Particularly, by establishing a detection period matrix T (T1, T2 and T3), the corresponding detection period is selected according to the updating frequency of the database corresponding to the name of the database to be monitored, so that the automatic construction of the first database is realized, the data is automatically supplemented, the richness and the initiative of the first database are improved, the continuous updating of the first database is ensured, and the comprehensiveness of data acquisition is ensured.
Particularly, by searching a plurality of pieces of information in the first database based on the key character strings, the data matched with the key character strings are acquired, and the data acquisition efficiency is improved.
Especially, whether the first information contains the key character string or not is judged, so that the first information is accurately judged, the accuracy of data acquisition is improved, the accurate judgment of the data acquisition is realized by adopting a multi-time verification mode, and the information acquisition efficiency is improved.
Particularly, whether the first information contains the key character string or not is judged for multiple times, if the first information contains the key character string in the comparison result of a certain number of times, the first information in the first information base is comprehensively judged, the judgment accuracy is improved, and the data acquisition accuracy is further improved.
Particularly, in the process of determining whether the standard character strings are contained or not, positions with differences are selected for re-selection, re-comparison is achieved, comparison accuracy is improved, mistakes and omissions of information collection are prevented, and accuracy of information collection is improved.
Particularly, the similar character strings of the key character strings are set, the approximate information base similar to the key character strings is established, secondary screening of information which does not accord with the key character strings is achieved, the data structure of the key words is changed, accuracy of information acquisition is improved, comparison is carried out by utilizing the similar character strings, secondary correction of data acquisition results is achieved, and accuracy and comprehensiveness of data acquisition are improved.
Drawings
Fig. 1 is a schematic structural diagram of a big data information acquisition system according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.
It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
Referring to fig. 1, a big data information collecting system according to an embodiment of the present invention includes: a first database 100 for storing data collected from external data into the first database through a first transmission channel at a transmission speed V10;
the acquisition module 200 is connected with the first database through a second transmission channel and is used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit 201 and a screening unit 202, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20;
the display module 300 is used for displaying the screened data in a classified manner so as to visually display the data matched with the key character strings;
the central control module 400 is respectively connected with the first database, the acquisition module and the display module;
when data in the first database is screened, the structure of the input key character string is adjusted according to the data quantity stored in the first database, if the data quantity in the first database is less than the first quantity n1, the structure of the key character string is reduced, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is increased;
if the second number n2> is larger than or equal to the first number n1, the structure of the key character string, the first transmission speed and the second transmission speed do not need to be adjusted;
if the data amount in the first database is larger than or equal to the second number n2, the structure of the key character string is increased, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is reduced.
Specifically, the big data information collection system provided by the embodiment of the invention can be applied to collection of enterprise innovation information, the first database comprises a patent information database and a policy information database, the data information of each patent information database can be a data link code, the policy information can be a link website of a webpage where a policy is located, in the actual storage process, when data in the first database is collected, a key character string is input by a user, and in the application, the key character string is compared with corresponding content in the first database, whether the data in the first database contains the key character string or a similar/similar character string of the key character string is determined, whether data collection based on the key character string is performed is determined, and the high efficiency of data collection is ensured.
Specifically, the first quantity and the second quantity are set, the data quantity in the first database is compared with the first quantity, the structure of the key character string input by the acquisition unit, the first transmission speed and the second transmission speed are adjusted according to the comparison result, so that the acquisition module is adjusted according to the quantity in the quantity library when acquiring data, the acquired data are more accurate, and the transmission speeds are matched according to the data transmission quantity when storage and data screening are performed after transmission through adjustment of the first transmission speed and the second transmission speed, so that efficient transmission of the data is realized, and the data acquisition efficiency is improved.
Specifically, when data collected from external data is stored in a first database, the first database is internally provided with a database name of a database to be monitored, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also performed in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is performed, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;
and if the first database and the database corresponding to the name of the database to be monitored belong to the same local area network, establishing a communication network transmission channel.
Specifically, the embodiment of the present invention determines the period of performing mirror image adjustment by the update frequency of the database corresponding to the monitored database name in the first database, and if the update frequency in the database is high and the update period is short, the period of time for performing mirror image adjustment is correspondingly short, so as to perform mirror image backup in time when the database is updated, thereby realizing the timely update of the first database; if the updating frequency is low, the updating period is long, the period time length for carrying out mirror image adjustment is long, the backup times are reduced, and the transmission pressure of the first transmission channel is reduced.
Specifically, a detection period matrix T (T1, T2, T3) is arranged in the first database, where T1 denotes a first detection period, T2 denotes a second detection period, T3 denotes a third detection period, and T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to the first frequency f1, the first detection period T1 is used to monitor whether the database corresponding to the name of the database to be monitored is updated;
if the updating frequency of the database corresponding to the name of the database to be monitored belongs to the second frequency f2, monitoring whether the database is updated or not by adopting a second detection period T2;
if the update frequency of the database corresponding to the name of the database to be monitored belongs to the third frequency f3, a third detection period T3 is adopted to monitor whether the database corresponding to the name of the database to be monitored is updated, wherein f1 is less than f2 and less than f 3.
Specifically, according to the embodiment of the invention, the detection period matrix T (T1, T2, T3) is established, and the corresponding detection period is selected according to the update frequency of the database corresponding to the name of the database to be monitored, so that the automatic construction of the first database is realized, the data is automatically supplemented, the richness and the initiative of the first database are improved, the continuous update of the first database is ensured, and the comprehensiveness of data acquisition is ensured.
Specifically, the first database includes a plurality of pieces of information N1, N2, N3 … Nn, the length of each piece of information is L1, L2, L3 … Ln, the length of the key string is set to be the length Ln of the standard string, when it is determined whether the information in the first database includes the key string, the length of each piece of information is compared with the length of the standard string, if Li < the length Ln of the standard string, it indicates that the information does not include the key string, and the information does not need to be collected;
if Li is larger than or equal to the length ln of the standard character string, establishing a first information base matrix M (M1, M2 … Mk) for the information base meeting the length requirement, wherein k is smaller than n, when the first information base is judged, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is larger than 90%, the first data Mi contains the key character string;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
Specifically, according to the embodiment of the invention, the plurality of pieces of information in the first database are searched based on the key character strings, so that the data matched with the key character strings are acquired, and the data acquisition efficiency is improved.
Specifically, when determining whether the first information in the first information base contains a key character string, the method further includes: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
Specifically, the embodiment of the invention judges whether the first information contains the key character string, realizes accurate judgment of the first information, improves the accuracy of data acquisition, realizes accurate judgment of data acquisition by adopting a multi-time verification mode, and improves the information acquisition efficiency.
Specifically, in the comparison process, if the first information Mi is compared k times, and then the comparison result of 0.2 × k times indicates that the first information Mi includes the key character string, it is determined that the first information Mi does not include the character of the standard character string.
Specifically, according to the embodiment of the invention, whether the first information contains the key character string is judged for multiple times, and if the first information contains the key character string in the comparison result of a certain number of times, the first information in the first information base is comprehensively judged, so that the judgment accuracy is improved, and the data acquisition accuracy is further improved.
Specifically, when n characters selected from the first information Mi are compared with characters of the standard character string, if the overlapping rate of the n characters with the characters of the standard character string is less than or equal to 90%, a first character bit with a difference is found, the n characters are reselected from the first difference bit, the n characters are compared with the characters of the standard character string, if the overlapping rate is higher than 90%, it is indicated that the first information includes the key character string, and if the overlapping rate is less than or equal to 90%, further judgment is required.
Specifically, in the process of determining whether the standard character string is included, the positions with the difference are selected for re-selection, so that re-comparison is realized, the comparison accuracy is improved, the information acquisition is prevented from being mistaken and missed, and the information acquisition accuracy is improved.
Specifically, when it is necessary to further determine whether the first information includes the key string, the approximate information base includes a plurality of approximate character strings of the key string, which are Y1 and Y2 … Yn, respectively, the approximate character strings are similar or similar character strings of the key string, the first information is further determined according to the approximate information base, whether the first information includes the similar or similar field of the key string is determined, if so, the first information includes the key string, and if not, the first information is determined not to include the key string.
Specifically, the embodiment of the invention sets the similar character string of the key character string, establishes the approximate information base similar to the key character string, realizes secondary screening of information which does not accord with the key character string, changes the data structure of the key word, improves the accuracy of information acquisition, utilizes the approximate character string to perform comparison, realizes secondary correction of the result of data acquisition, and improves the accuracy and comprehensiveness of data acquisition.
Specifically, when comparing, adding a conversion code to each approximate character string, updating the approximate character strings Y1 and Y2 … Yn in the approximate information base to Y11 and Y12 … Y1n, selecting n characters from the 1 st character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the kth character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
the method includes the steps of selecting n characters from the kth character in the first information Mi, comparing the n characters with approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is larger than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings.
Specifically, in the embodiment of the present invention, a plurality of pieces of first information in the first information base are determined to be compared one by one, and when the first information base does not include a key character string, the relation between the first information in the first information base and a similar or similar character string is realized through the approximate character string and the deformation of the approximate character string, and if the first information includes the similar or similar character string of the key character string, it indicates that the first information includes the key character string, and the first information needs to be collected, so that the accuracy of information determination is realized, and further, the accuracy of data collection is improved, and the data collection efficiency is improved.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A big data information acquisition system, comprising:
a first database for storing data collected from external data into the first database through a first transmission channel at a transmission speed V10;
the acquisition module is connected with the first database through a second transmission channel and used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit and a screening unit, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20;
the display module is used for displaying the screened data in a classified manner so as to visually display the data matched with the key character strings;
the central control module is respectively connected with the first database, the acquisition module and the display module;
when data in the first database is screened, the structure of the input key character string is adjusted according to the data quantity stored in the first database, if the data quantity in the first database is less than the first quantity n1, the structure of the key character string is reduced, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is increased;
if the second number n2> is larger than or equal to the first number n1, the structure of the key character string, the first transmission speed and the second transmission speed do not need to be adjusted;
if the data amount in the first database is larger than or equal to the second number n2, the structure of the key character string is increased, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is reduced.
2. Big data information collecting system according to claim 1,
when data collected from external data is stored in a first database, the database name of a database to be monitored is arranged in the first database, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also carried out in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is carried out, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;
and if the first database and the database corresponding to the name of the database to be monitored belong to the same local area network, establishing a communication network transmission channel.
3. Big data information collecting system according to claim 2,
a detection period matrix T (T1, T2 and T3) is arranged in a first database, wherein T1 represents a first detection period, T2 represents a second detection period, T3 represents a third detection period, T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to a first frequency f1, the first detection period T1 is adopted to monitor whether the database is updated or not;
if the updating frequency of the database corresponding to the name of the database to be monitored belongs to the second frequency f2, monitoring whether the database is updated or not by adopting a second detection period T2;
if the update frequency of the database corresponding to the name of the database to be monitored belongs to the third frequency f3, a third detection period T3 is adopted to monitor whether the database corresponding to the name of the database to be monitored is updated, wherein f1 is less than f2 and less than f 3.
4. Big data information collecting system according to claim 3,
the first database comprises a plurality of pieces of information N1, N2 and N3 … Nn, the length of each piece of information is L1, L2 and L3 … Ln respectively, the length of a key character string is set to be a standard character string length lx, when whether the information in the first database contains the key character string or not is determined, the length of each piece of information is compared with the length of the standard character string respectively, if Li < the length lx of the standard character string, the information does not contain the key character string, and the information does not need to be collected;
if Li is larger than or equal to the length lx of the standard character string, establishing a first information base matrix M (M1, M2 … Mk) for the information base meeting the length requirement, wherein k is smaller than n, when the first information base is judged, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is larger than 90%, the first data Mi contains the key character string;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
5. Big data information collecting system according to claim 4,
when determining whether the first information in the first information base contains the key character string, the method further comprises the following steps: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;
and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.
6. The big data information collecting system of claim 5, wherein in the comparing process, if the first information Mi is compared k times, and if the first information Mi contains the key character string as a result of the comparison 0.2 xk times, it is determined that the first information Mi does not contain the character of the standard character string.
7. The big data information collection system of claim 6, wherein if n characters selected from the first information Mi are compared with characters of a standard character string, if a rate of coincidence of the n characters with the characters of the standard character string is less than or equal to 90%, a first character position with a difference is found, the n characters are reselected from the first difference position, the n characters are compared with the characters of the standard character string, if the rate of coincidence is higher than 90%, it is indicated that the first information includes a key character string, and if the rate of coincidence is less than or equal to 90%, further determination is required.
8. Big data information collecting system according to claim 7,
when the first information needs to be further judged whether to contain the key character string, the approximate information base contains a plurality of approximate character strings of the key character string, wherein the approximate character strings are Y1 and Y2 … Yn respectively, the approximate character strings are similar or similar character strings of the key character string, the first information is further judged according to the approximate information base, whether the first information contains the similar or similar fields of the key character string or not is judged, if yes, the first information contains the key character string, and if not, the first information is determined not to contain the key character string.
9. Big data information collecting system according to claim 8,
when the comparison is carried out, a conversion code is added to each approximate character string, the approximate character strings Y1 and Y2 … Yn in the approximate information base are updated to Y11 and Y12 … Y1n, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, the first information Mi contains similar or similar information of the key character strings and belongs to the key character strings;
selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the kth character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;
the method includes the steps of selecting n characters from the kth character in the first information Mi, comparing the n characters with approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is larger than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings.
CN202110653306.7A 2021-06-11 2021-06-11 Big data information acquisition system Active CN113392286B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110653306.7A CN113392286B (en) 2021-06-11 2021-06-11 Big data information acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110653306.7A CN113392286B (en) 2021-06-11 2021-06-11 Big data information acquisition system

Publications (2)

Publication Number Publication Date
CN113392286A CN113392286A (en) 2021-09-14
CN113392286B true CN113392286B (en) 2022-02-11

Family

ID=77620573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110653306.7A Active CN113392286B (en) 2021-06-11 2021-06-11 Big data information acquisition system

Country Status (1)

Country Link
CN (1) CN113392286B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114146388B (en) * 2022-02-07 2022-05-03 北京新赛点体育投资股份有限公司 Data processing system and method based on big data

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104661275A (en) * 2014-12-11 2015-05-27 北京邮电大学 Method for transmitting data in opportunity network
CN205561907U (en) * 2016-04-05 2016-09-07 辽宁卓异装备制造股份有限公司 City underground pipe network monitoring devices
CN106131017A (en) * 2016-07-14 2016-11-16 何钟柱 Cloud computing information security visualization system based on trust computing
CN107122222A (en) * 2017-04-20 2017-09-01 深圳大普微电子科技有限公司 The search system and method for a kind of character string
CN107222370A (en) * 2017-07-11 2017-09-29 王焱华 A kind of big data plateform system
CN107368517A (en) * 2017-06-02 2017-11-21 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN206906295U (en) * 2017-06-02 2018-01-19 西安达效软件有限公司 A kind of big data information gathering and processing system
CN108234347A (en) * 2017-12-29 2018-06-29 北京神州绿盟信息安全科技股份有限公司 A kind of method, apparatus, the network equipment and storage medium for extracting feature string
US10169434B1 (en) * 2016-01-31 2019-01-01 Splunk Inc. Tokenized HTTP event collector
US10438288B1 (en) * 2014-06-06 2019-10-08 Marstone, Inc. Multidimensional asset management tag pivot apparatuses, methods and systems
CN110741637A (en) * 2017-04-28 2020-01-31 阿斯卡瓦公司 Performing multi-dimensional search and content-associative retrieval by lossless reduction of data using a base data filter and performing lossless reduction of data that has been losslessly reduced using the base data filter
US10599885B2 (en) * 2017-05-10 2020-03-24 Oracle International Corporation Utilizing discourse structure of noisy user-generated content for chatbot learning
WO2020087239A1 (en) * 2018-10-30 2020-05-07 北京比特大陆科技有限公司 Big data computing acceleration system
CN112532755A (en) * 2021-02-18 2021-03-19 广州汇图计算机信息技术有限公司 Interest list pushing system based on heterogeneous information network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10615848B1 (en) * 2018-09-28 2020-04-07 The Boeing Company Predictive analytics for broadband over power line data

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10438288B1 (en) * 2014-06-06 2019-10-08 Marstone, Inc. Multidimensional asset management tag pivot apparatuses, methods and systems
CN104661275A (en) * 2014-12-11 2015-05-27 北京邮电大学 Method for transmitting data in opportunity network
US10169434B1 (en) * 2016-01-31 2019-01-01 Splunk Inc. Tokenized HTTP event collector
CN205561907U (en) * 2016-04-05 2016-09-07 辽宁卓异装备制造股份有限公司 City underground pipe network monitoring devices
CN106131017A (en) * 2016-07-14 2016-11-16 何钟柱 Cloud computing information security visualization system based on trust computing
CN107122222A (en) * 2017-04-20 2017-09-01 深圳大普微电子科技有限公司 The search system and method for a kind of character string
CN110741637A (en) * 2017-04-28 2020-01-31 阿斯卡瓦公司 Performing multi-dimensional search and content-associative retrieval by lossless reduction of data using a base data filter and performing lossless reduction of data that has been losslessly reduced using the base data filter
US10599885B2 (en) * 2017-05-10 2020-03-24 Oracle International Corporation Utilizing discourse structure of noisy user-generated content for chatbot learning
CN206906295U (en) * 2017-06-02 2018-01-19 西安达效软件有限公司 A kind of big data information gathering and processing system
CN107368517A (en) * 2017-06-02 2017-11-21 上海恺英网络科技有限公司 A kind of method and apparatus of high amount of traffic inquiry
CN107222370A (en) * 2017-07-11 2017-09-29 王焱华 A kind of big data plateform system
CN108234347A (en) * 2017-12-29 2018-06-29 北京神州绿盟信息安全科技股份有限公司 A kind of method, apparatus, the network equipment and storage medium for extracting feature string
WO2020087239A1 (en) * 2018-10-30 2020-05-07 北京比特大陆科技有限公司 Big data computing acceleration system
CN112532755A (en) * 2021-02-18 2021-03-19 广州汇图计算机信息技术有限公司 Interest list pushing system based on heterogeneous information network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Data collection architecture for Big Data - a framework for a research";Wout Hofman;《research gate》;20150830;第1-9页 *
"SQL注入行为实时在线智能检测技术研究";李铭 等;《湖南大学学报(自然科学版)》;20200831;第31-41页 *

Also Published As

Publication number Publication date
CN113392286A (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN110807085B (en) Fault information query method and device, storage medium and electronic device
CN103136249A (en) System and method of multiplex mode isomerous data integration
CN103605651A (en) Data processing showing method based on on-line analytical processing (OLAP) multi-dimensional analysis
EP4155974A1 (en) Knowledge graph construction method and apparatus, check method and storage medium
EP3399443A1 (en) Automated assistance for generating relevant and valuable search results for an entity of interest
CN113392286B (en) Big data information acquisition system
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
CN110287219A (en) A kind of data processing method and system
CN114358726A (en) Drug inhibition early warning research and judgment method and system based on combination of reporting clues and multiple data sources
CN110941757A (en) Big data based policy information query pushing system and method
CN114491081A (en) Electric power data tracing method and system based on data blood relationship graph
CN109919225B (en) Method for identifying user interest points based on space-time data
US9588639B2 (en) Application display method and apparatus
CN115379308B (en) Internet of things equipment data acquisition system based on satellite remote communication
CN115858598A (en) Enterprise big data-based target information screening and matching method and related equipment
CN112418945B (en) Economic hotspot discovery analysis system and method based on enterprise service portal
CN109976271B (en) Method for calculating information structure order degree by using information representation method
CN111061771A (en) Big data information acquisition and transmission system
US8775461B2 (en) Case search system, case database, case search apparatus, case search method, and program
CN110633430B (en) Event discovery method, apparatus, device, and computer-readable storage medium
CN110175200A (en) A kind of abnormal energy analysis method and system based on intelligent algorithm
US7433865B2 (en) Information collection retrieval system
CN110263082A (en) The data distribution analysis method of database, device, electronic equipment and storage medium
CN109408533A (en) Data processing and search method, database, search engine and system
CN114840474B (en) Data migration method and system of patent index database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant