CN113392286B

CN113392286B - Big data information acquisition system

Info

Publication number: CN113392286B
Application number: CN202110653306.7A
Authority: CN
Inventors: 邢家辉; 黄毓桦
Original assignee: Shenzhen Hongbo Information Technology Co ltd
Current assignee: Shenzhen Hongbo Information Technology Co ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-02-11
Anticipated expiration: 2041-06-11
Also published as: CN113392286A

Abstract

The invention relates to a big data information acquisition system, which comprises a first database, a second database and a data processing module, wherein the first database is used for storing data acquired from external data into the first database through a first transmission channel at a transmission speed V10; the acquisition module is connected with the first database through a second transmission channel and used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit and a screening unit, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20; and the central control module is respectively connected with the first database, the acquisition module and the display module. Through the adjustment to first transmission speed and second transmission speed for carry out transmission speed's matching according to the data transmission volume when carrying out storage after the transmission and data screening, realize the high-efficient transmission of data, improve data acquisition efficiency.

Description

Big data information acquisition system

Technical Field

The invention relates to the technical field of data processing, in particular to a big data information acquisition system.

Background

Data (Data) is a representation of facts, concepts or instructions that can be processed by either manual or automated means. The basic purpose of data processing is to extract and derive valuable, meaningful data for certain people from large, possibly chaotic, unintelligible amounts of data.

For enterprises, data generated every day and data to be processed are numerous and complex, whether the acquired data are comprehensive or not and whether the data processing mode is correct or not influence enterprise decisions, and if the data are not comprehensive or are not processed properly, serious consequences or even irrecoverable huge losses are brought to the enterprises.

In the prior art, how to acquire useful data for enterprises from massive data becomes a concern of each large enterprise, searching can be performed through a search engine to acquire related data, but whether the data is collected comprehensively cannot be judged.

Disclosure of Invention

Therefore, the invention provides a big data information acquisition system which can solve the problem of incomplete information acquisition.

In order to achieve the above object, the present invention provides a big data information collecting system, including:

a first database for storing data collected from external data into the first database through a first transmission channel at a transmission speed V10;

the acquisition module is connected with the first database through a second transmission channel and used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit and a screening unit, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20;

the display module is used for displaying the screened data in a classified manner so as to visually display the data matched with the key character strings;

the central control module is respectively connected with the first database, the acquisition module and the display module;

when data in the first database is screened, the structure of the input key character string is adjusted according to the data quantity stored in the first database, if the data quantity in the first database is less than the first quantity n1, the structure of the key character string is reduced, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is increased;

if the second number n2> is larger than or equal to the first number n1, the structure of the key character string, the first transmission speed and the second transmission speed do not need to be adjusted;

if the data amount in the first database is larger than or equal to the second number n2, the structure of the key character string is increased, the first transmission speed of the first transmission channel is increased, and the second transmission speed of the second transmission channel is reduced.

Further, when data collected from external data is stored in a first database, the database name of the database to be monitored is set in the first database, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also performed in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is performed, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;

and if the first database and the database corresponding to the name of the database to be monitored belong to the same local area network, establishing a communication network transmission channel.

Further, a detection period matrix T (T1, T2, T3) is arranged in the first database, where T1 denotes a first detection period, T2 denotes a second detection period, T3 denotes a third detection period, and T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to the first frequency f1, the first detection period T1 is used to monitor whether the database corresponding to the name of the database to be monitored is updated;

if the updating frequency of the database corresponding to the name of the database to be monitored belongs to the second frequency f2, monitoring whether the database is updated or not by adopting a second detection period T2;

if the update frequency of the database corresponding to the name of the database to be monitored belongs to the third frequency f3, a third detection period T3 is adopted to monitor whether the database corresponding to the name of the database to be monitored is updated, wherein f1 is less than f2 and less than f 3.

Further, the first database includes a plurality of pieces of information N1, N2, N3 … Nn, the length of each piece of information is L1, L2, L3 … Ln, the length of the key string is set to be the length Ln of the standard string, when it is determined whether the information in the first database includes the key string, the length of each piece of information is compared with the length of the standard string, if Li < the length Ln of the standard string, it indicates that the information does not include the key string, and it is not necessary to collect the information;

if Li is larger than or equal to the length ln of the standard character string, establishing a first information base matrix M (M1, M2 … Mk) for the information base meeting the length requirement, wherein k is smaller than n, when the first information base is judged, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is larger than 90%, the first data Mi contains the key character string;

selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;

and selecting n characters from the kth character in the first information Mi, comparing the n characters with the characters of the standard character string, and if the character overlapping rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string.

Further, when determining whether the first information in the first information base contains the key character string, the method further comprises: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;

selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;

Further, in the comparison process, if the first information Mi is compared k times, and if the comparison result of 0.2 × k times indicates that the first information Mi includes the key character string, it is determined that the first information Mi does not include the character of the standard character string.

Further, when comparing the n characters selected from the first information Mi with the characters of the standard character string, if the overlapping rate of the n characters with the characters of the standard character string is less than or equal to 90%, finding the first character bit with difference, reselecting the n characters from the first difference bit, comparing the n characters with the characters of the standard character string, if the overlapping rate is higher than 90%, indicating that the first information includes the key character string, and if the overlapping rate is less than or equal to 90%, requiring further judgment.

Further, when it is necessary to further determine whether the first information includes the key string, the approximate information base includes a plurality of approximate character strings of the key string, which are Y1 and Y2 … Yn, respectively, the approximate character strings are similar or similar character strings of the key string, the first information is further determined according to the approximate information base, whether the first information includes the similar or similar fields of the key string is determined, if so, the first information includes the key string, and if not, the first information is determined not to include the key string.

Further, when comparing, adding a conversion code to each approximate character string, updating the approximate character strings Y1 and Y2 … Yn in the approximate information base to Y11 and Y12 … Y1n, starting from the 1 st character in the first information Mi, selecting n characters, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;

selecting n characters from the 2 nd character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;

selecting n characters from the kth character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;

selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;

the method includes the steps of selecting n characters from the kth character in the first information Mi, comparing the n characters with approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is larger than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings.

Compared with the prior art, the method has the advantages that the first quantity and the second quantity are set, the data quantity in the first database is compared with the first quantity, the structure of the key character string input by the acquisition unit, the first transmission speed and the second transmission speed are adjusted according to the comparison result, so that the acquisition module is adjusted according to the quantity in the quantity library when acquiring data, the acquired data are more accurate, the transmission speeds are matched according to the data transmission quantity when the data are stored and screened after being transmitted by adjusting the first transmission speed and the second transmission speed, the data are efficiently transmitted, and the data acquisition efficiency is improved.

Particularly, the period for carrying out mirror image adjustment is determined by the updating frequency of the database corresponding to the monitored database name in the first database, and if the updating frequency in the database is high and the updating period is short, the period time length for carrying out mirror image adjustment is correspondingly small, so that mirror image backup can be carried out in time when the database is updated, and the first database can be updated in time; if the updating frequency is low, the updating period is long, the period time length for carrying out mirror image adjustment is long, the backup times are reduced, and the transmission pressure of the first transmission channel is reduced.

Particularly, by establishing a detection period matrix T (T1, T2 and T3), the corresponding detection period is selected according to the updating frequency of the database corresponding to the name of the database to be monitored, so that the automatic construction of the first database is realized, the data is automatically supplemented, the richness and the initiative of the first database are improved, the continuous updating of the first database is ensured, and the comprehensiveness of data acquisition is ensured.

Particularly, by searching a plurality of pieces of information in the first database based on the key character strings, the data matched with the key character strings are acquired, and the data acquisition efficiency is improved.

Especially, whether the first information contains the key character string or not is judged, so that the first information is accurately judged, the accuracy of data acquisition is improved, the accurate judgment of the data acquisition is realized by adopting a multi-time verification mode, and the information acquisition efficiency is improved.

Particularly, whether the first information contains the key character string or not is judged for multiple times, if the first information contains the key character string in the comparison result of a certain number of times, the first information in the first information base is comprehensively judged, the judgment accuracy is improved, and the data acquisition accuracy is further improved.

Particularly, in the process of determining whether the standard character strings are contained or not, positions with differences are selected for re-selection, re-comparison is achieved, comparison accuracy is improved, mistakes and omissions of information collection are prevented, and accuracy of information collection is improved.

Particularly, the similar character strings of the key character strings are set, the approximate information base similar to the key character strings is established, secondary screening of information which does not accord with the key character strings is achieved, the data structure of the key words is changed, accuracy of information acquisition is improved, comparison is carried out by utilizing the similar character strings, secondary correction of data acquisition results is achieved, and accuracy and comprehensiveness of data acquisition are improved.

Drawings

Fig. 1 is a schematic structural diagram of a big data information acquisition system according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.

It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Referring to fig. 1, a big data information collecting system according to an embodiment of the present invention includes: a first database 100 for storing data collected from external data into the first database through a first transmission channel at a transmission speed V10;

the acquisition module 200 is connected with the first database through a second transmission channel and is used for screening data matched with the key character string from the first database according to the input key character string, the acquisition module comprises an establishing unit 201 and a screening unit 202, the establishing unit is used for establishing the second transmission channel, and the screening unit is used for transmitting the data screened from the first database to the acquisition module through the second transmission channel at a second transmission speed V20;

the display module 300 is used for displaying the screened data in a classified manner so as to visually display the data matched with the key character strings;

the central control module 400 is respectively connected with the first database, the acquisition module and the display module;

Specifically, the big data information collection system provided by the embodiment of the invention can be applied to collection of enterprise innovation information, the first database comprises a patent information database and a policy information database, the data information of each patent information database can be a data link code, the policy information can be a link website of a webpage where a policy is located, in the actual storage process, when data in the first database is collected, a key character string is input by a user, and in the application, the key character string is compared with corresponding content in the first database, whether the data in the first database contains the key character string or a similar/similar character string of the key character string is determined, whether data collection based on the key character string is performed is determined, and the high efficiency of data collection is ensured.

Specifically, the first quantity and the second quantity are set, the data quantity in the first database is compared with the first quantity, the structure of the key character string input by the acquisition unit, the first transmission speed and the second transmission speed are adjusted according to the comparison result, so that the acquisition module is adjusted according to the quantity in the quantity library when acquiring data, the acquired data are more accurate, and the transmission speeds are matched according to the data transmission quantity when storage and data screening are performed after transmission through adjustment of the first transmission speed and the second transmission speed, so that efficient transmission of the data is realized, and the data acquisition efficiency is improved.

Specifically, when data collected from external data is stored in a first database, the first database is internally provided with a database name of a database to be monitored, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also performed in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is performed, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;

Specifically, the embodiment of the present invention determines the period of performing mirror image adjustment by the update frequency of the database corresponding to the monitored database name in the first database, and if the update frequency in the database is high and the update period is short, the period of time for performing mirror image adjustment is correspondingly short, so as to perform mirror image backup in time when the database is updated, thereby realizing the timely update of the first database; if the updating frequency is low, the updating period is long, the period time length for carrying out mirror image adjustment is long, the backup times are reduced, and the transmission pressure of the first transmission channel is reduced.

Specifically, a detection period matrix T (T1, T2, T3) is arranged in the first database, where T1 denotes a first detection period, T2 denotes a second detection period, T3 denotes a third detection period, and T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to the first frequency f1, the first detection period T1 is used to monitor whether the database corresponding to the name of the database to be monitored is updated;

Specifically, according to the embodiment of the invention, the detection period matrix T (T1, T2, T3) is established, and the corresponding detection period is selected according to the update frequency of the database corresponding to the name of the database to be monitored, so that the automatic construction of the first database is realized, the data is automatically supplemented, the richness and the initiative of the first database are improved, the continuous update of the first database is ensured, and the comprehensiveness of data acquisition is ensured.

Specifically, the first database includes a plurality of pieces of information N1, N2, N3 … Nn, the length of each piece of information is L1, L2, L3 … Ln, the length of the key string is set to be the length Ln of the standard string, when it is determined whether the information in the first database includes the key string, the length of each piece of information is compared with the length of the standard string, if Li < the length Ln of the standard string, it indicates that the information does not include the key string, and the information does not need to be collected;

Specifically, according to the embodiment of the invention, the plurality of pieces of information in the first database are searched based on the key character strings, so that the data matched with the key character strings are acquired, and the data acquisition efficiency is improved.

Specifically, when determining whether the first information in the first information base contains a key character string, the method further includes: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;

Specifically, the embodiment of the invention judges whether the first information contains the key character string, realizes accurate judgment of the first information, improves the accuracy of data acquisition, realizes accurate judgment of data acquisition by adopting a multi-time verification mode, and improves the information acquisition efficiency.

Specifically, in the comparison process, if the first information Mi is compared k times, and then the comparison result of 0.2 × k times indicates that the first information Mi includes the key character string, it is determined that the first information Mi does not include the character of the standard character string.

Specifically, according to the embodiment of the invention, whether the first information contains the key character string is judged for multiple times, and if the first information contains the key character string in the comparison result of a certain number of times, the first information in the first information base is comprehensively judged, so that the judgment accuracy is improved, and the data acquisition accuracy is further improved.

Specifically, when n characters selected from the first information Mi are compared with characters of the standard character string, if the overlapping rate of the n characters with the characters of the standard character string is less than or equal to 90%, a first character bit with a difference is found, the n characters are reselected from the first difference bit, the n characters are compared with the characters of the standard character string, if the overlapping rate is higher than 90%, it is indicated that the first information includes the key character string, and if the overlapping rate is less than or equal to 90%, further judgment is required.

Specifically, in the process of determining whether the standard character string is included, the positions with the difference are selected for re-selection, so that re-comparison is realized, the comparison accuracy is improved, the information acquisition is prevented from being mistaken and missed, and the information acquisition accuracy is improved.

Specifically, when it is necessary to further determine whether the first information includes the key string, the approximate information base includes a plurality of approximate character strings of the key string, which are Y1 and Y2 … Yn, respectively, the approximate character strings are similar or similar character strings of the key string, the first information is further determined according to the approximate information base, whether the first information includes the similar or similar field of the key string is determined, if so, the first information includes the key string, and if not, the first information is determined not to include the key string.

Specifically, the embodiment of the invention sets the similar character string of the key character string, establishes the approximate information base similar to the key character string, realizes secondary screening of information which does not accord with the key character string, changes the data structure of the key word, improves the accuracy of information acquisition, utilizes the approximate character string to perform comparison, realizes secondary correction of the result of data acquisition, and improves the accuracy and comprehensiveness of data acquisition.

Specifically, when comparing, adding a conversion code to each approximate character string, updating the approximate character strings Y1 and Y2 … Yn in the approximate information base to Y11 and Y12 … Y1n, selecting n characters from the 1 st character in the first information Mi, comparing the n characters with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, indicating that the first information Mi contains similar or similar information of the key character strings and belongs to the contained key character strings;

Specifically, in the embodiment of the present invention, a plurality of pieces of first information in the first information base are determined to be compared one by one, and when the first information base does not include a key character string, the relation between the first information in the first information base and a similar or similar character string is realized through the approximate character string and the deformation of the approximate character string, and if the first information includes the similar or similar character string of the key character string, it indicates that the first information includes the key character string, and the first information needs to be collected, so that the accuracy of information determination is realized, and further, the accuracy of data collection is improved, and the data collection efficiency is improved.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A big data information acquisition system, comprising:

2. Big data information collecting system according to claim 1,

when data collected from external data is stored in a first database, the database name of a database to be monitored is arranged in the first database, if data is added or reduced in the database corresponding to the database name to be monitored, mirror image adjustment is also carried out in the first database, so that data content in the first database corresponds to data change in the database corresponding to the database name to be monitored in real time, and when mirror image adjustment is carried out, if the first database and the database corresponding to the database name to be monitored belong to different networks, a heterogeneous network transmission channel is established;

3. Big data information collecting system according to claim 2,

a detection period matrix T (T1, T2 and T3) is arranged in a first database, wherein T1 represents a first detection period, T2 represents a second detection period, T3 represents a third detection period, T1> T2> T3, a plurality of names of databases to be monitored are arranged, and according to the update frequency of the database corresponding to the name of the database to be monitored, if the update frequency of the database corresponding to the name of the database to be monitored belongs to a first frequency f1, the first detection period T1 is adopted to monitor whether the database is updated or not;

4. Big data information collecting system according to claim 3,

the first database comprises a plurality of pieces of information N1, N2 and N3 … Nn, the length of each piece of information is L1, L2 and L3 … Ln respectively, the length of a key character string is set to be a standard character string length lx, when whether the information in the first database contains the key character string or not is determined, the length of each piece of information is compared with the length of the standard character string respectively, if Li < the length lx of the standard character string, the information does not contain the key character string, and the information does not need to be collected;

if Li is larger than or equal to the length lx of the standard character string, establishing a first information base matrix M (M1, M2 … Mk) for the information base meeting the length requirement, wherein k is smaller than n, when the first information base is judged, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is larger than 90%, the first data Mi contains the key character string;

5. Big data information collecting system according to claim 4,

when determining whether the first information in the first information base contains the key character string, the method further comprises the following steps: selecting n characters from the last to last character in the first information Mi from back to front, comparing the n characters with the characters of the standard character string, and if the character coincidence rate of the n characters with the standard character string is more than 90%, indicating that the first information Mi contains the key character string;

6. The big data information collecting system of claim 5, wherein in the comparing process, if the first information Mi is compared k times, and if the first information Mi contains the key character string as a result of the comparison 0.2 xk times, it is determined that the first information Mi does not contain the character of the standard character string.

7. The big data information collection system of claim 6, wherein if n characters selected from the first information Mi are compared with characters of a standard character string, if a rate of coincidence of the n characters with the characters of the standard character string is less than or equal to 90%, a first character position with a difference is found, the n characters are reselected from the first difference position, the n characters are compared with the characters of the standard character string, if the rate of coincidence is higher than 90%, it is indicated that the first information includes a key character string, and if the rate of coincidence is less than or equal to 90%, further determination is required.

8. Big data information collecting system according to claim 7,

when the first information needs to be further judged whether to contain the key character string, the approximate information base contains a plurality of approximate character strings of the key character string, wherein the approximate character strings are Y1 and Y2 … Yn respectively, the approximate character strings are similar or similar character strings of the key character string, the first information is further judged according to the approximate information base, whether the first information contains the similar or similar fields of the key character string or not is judged, if yes, the first information contains the key character string, and if not, the first information is determined not to contain the key character string.

9. Big data information collecting system according to claim 8,

when the comparison is carried out, a conversion code is added to each approximate character string, the approximate character strings Y1 and Y2 … Yn in the approximate information base are updated to Y11 and Y12 … Y1n, n characters are selected from the 1 st character in the first information Mi, the n characters are compared with the approximate character strings Y11 and Y12 … Y1n respectively, and if the coincidence rate of the n characters with the approximate character strings is more than 90%, the first information Mi contains similar or similar information of the key character strings and belongs to the key character strings;