CN110515895B

CN110515895B - Method and system for carrying out associated storage on data files in big data storage system

Info

Publication number: CN110515895B
Application number: CN201910810845.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Beijing yanshan electronic equipment factory
Current assignee: Beijing yanshan electronic equipment factory
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2023-06-23
Anticipated expiration: 2039-08-30
Also published as: CN110515895A

Abstract

The invention discloses a method and a system for carrying out associated storage on data files in a big data storage system, wherein the method comprises the following steps: logically dividing a storage area of a big data storage system; storing each new data file in the base storage area and serving as a base data file; forming a business topic set of the basic data file by the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold, and moving the basic data file to the temporary storage area and using the basic data file as a temporary data file; determining a set of strongly and/or weakly associated data files associated with each business topic associated with the temporary data file and generating a topic-matched set of temporary data files; updating the topic matching set of the associated temporary data file with the determined matching level; the particular temporary data file is moved to a strongly or weakly associated data file set corresponding to the assigned business topic and file set type.

Description

Method and system for carrying out associated storage on data files in big data storage system

Technical Field

The present invention relates to the field of large data storage, and more particularly, to a method and system for associative storage of data files in a large data storage system.

Background

Currently, with the increasing amount of data and the increasing demand for data analysis, large data storage systems have become the infrastructure for many user needs. Large data storage systems are typically capable of storing large amounts of data, thereby providing storage support for data processing. However, in the existing large data storage system, there is no effective way to store the relevance of the data files, but the data provided by the user is directly stored according to the user's needs. This approach results in inefficiency in data storage.

Disclosure of Invention

According to one aspect of the present invention, there is provided a method of associative storage of data files in a large data storage system, the method comprising:

determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on data files in a big data storage system;

semantic extraction is performed on each of the plurality of business topics to determine a plurality of tag items associated with each business topic;

Logically dividing a storage area of a big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;

storing each new data file received by the big data storage system through the interface device in the basic storage area and recording the initial storage time of each new data file, generating summary data for each new data file and using each new data file stored in the basic storage area as a basic data file;

forming a business topic set of the basic data file by the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold, and moving the basic data file to the temporary storage area and using the basic data file as a temporary data file;

determining a strong and/or weak association data file set associated with each business topic associated with the temporary data file in the strong and/or weak association storage area and generating a topic matching set of the temporary data file;

copying the temporary data files into strong and/or weak associated data file sets of corresponding service topics in the topic matching set and marking the temporary data files as recommended files, and acquiring feedback information of service request equipment on the recommended files in the strong and/or weak associated data file sets of the corresponding service topics;

Determining a temporary data file associated with the received feedback information, processing a matching grade in a theme matching set of the associated temporary data file based on the received feedback information, and updating the theme matching set of the associated temporary data file by utilizing the determined matching grade;

and determining the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moving the specific temporary data file to the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.

The big data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic;

the semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes:

semantic extraction is carried out on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and the plurality of keywords capable of describing each business topic are determined to be a plurality of tag items associated with each business topic;

The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area:

determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if yes, no processing is performed; .

Generating summary data for each new data file includes: summarizing file content of each new data file to generate summary data of each new data file;

according to another aspect of the present invention, there is provided a system for associative storage of data files in a big data storage system, the system comprising:

the analysis equipment is used for determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on the data files in the big data storage system;

extracting device, which performs semantic extraction on each service topic in the plurality of service topics to determine a plurality of tag items associated with each service topic;

Dividing equipment for logically dividing a storage area of the big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;

recording means for storing each new data file received by the large data storage system through the interface means in the base storage area and recording a start storage time of each new data file, generating summary data for each new data file and using each new data file stored in the base storage area as a base data file;

the computing equipment forms a business topic set of the basic data file by the business topic with the association degree with the basic data file being greater than or equal to the topic association degree threshold value, and moves the basic data file to the temporary storage area and is used as a temporary data file;

processing means for determining a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generating a topic matching set of temporary data files;

the marking device copies the temporary data file into a strong and/or weak associated data file set of a corresponding service theme in the theme matching set and marks the temporary data file as a recommended file, and obtains feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme;

Updating equipment, which determines a temporary data file associated with the received feedback information, processes the matching grade in the theme matching set of the associated temporary data file based on the received feedback information, and updates the theme matching set of the associated temporary data file by utilizing the determined matching grade;

and the mobile equipment determines the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moves the specific temporary data file into the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.

determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if so, no processing is performed.

Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.

Drawings

FIG. 1 is a flow chart of a method of associative storage of data files in a large data storage system in accordance with the present invention;

FIG. 2 is a schematic diagram of the storage area of the big data storage system of the present invention; and

FIG. 3 is a schematic diagram of a system for associative storage of data files in a large data storage system according to the present invention.

Detailed Description

FIG. 1 is a flow chart of a method 100 of storing data files in association in a large data storage system in accordance with the present invention.

In step 101, a plurality of business topics relating to a plurality of data businesses that perform data analysis on data files within a large data storage system are determined.

Specifically, the description information of each data service in a plurality of data services which are subjected to data analysis aiming at the data file in the big data storage system in a first preset time interval is obtained, the description information of each data service is analyzed to determine the data theme of each data service, so that a plurality of data themes are obtained, the same data theme combination statistics are carried out on the plurality of data themes to determine the statistics times of each data theme, and the data themes with the statistics times exceeding a time threshold are selected as the business themes to determine the plurality of business themes.

The first predetermined time interval is 5 natural days, 10 natural days, 20 natural days, 30 natural days, 50 natural days, 80 natural days, 100 natural days, or 200 natural days. The first predetermined time interval is a period of time with an end date being a previous natural day to the natural day on which the current time is located and a start date being a natural day that has elapsed, and the first predetermined time interval includes a plurality of natural days, or at least 5 natural days, or includes an integer number of natural days. The large data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.

The data service is a service initiated by the service request device for data analysis for a plurality of data files in the big data storage system. Each data service has description information, and the description information of the data service includes: an identifier of the service request device, a network address of the service request device, a topic name of the data service, and a service domain of the data service.

Parsing the description information of each data service to determine a data topic of each data service includes: and analyzing the description information of each data service to acquire the topic name and the service field of each data service, and combining the topic name and the service field of each data service to determine the data topic of the data service.

At least one identical data topic exists among the plurality of data topics. The step of carrying out the merging statistics of the same data topics on the plurality of data topics to determine the statistics of each data topic comprises the following steps: and carrying out merging statistics of the same data topics on the plurality of data topics to determine the statistics of each different data topic. The number of times threshold is 2 times, 3 times, 5 times, 8 times, 10 times, 15 times or 20 times.

At step 102, semantic extraction is performed on each of the plurality of business topics to determine a plurality of tag items associated with each business topic. The semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes: and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.

In step 103, the storage area of the big data storage system is logically divided into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area.

The method comprises the steps of dividing a storage area of a large data storage system into a basic storage area, a strong association storage area, a weak association storage area and a temporary storage area logically, wherein the basic storage area is used for storing new data files received by the large data storage system through an interface device, the strong association storage area is used for storing a plurality of existing strong association data file sets with strong association with any business theme, the weak association storage area is used for storing a plurality of existing weak association data file sets with weak association with any business theme, and the temporary storage area is used for temporarily storing data files subjected to association test.

The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area: determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if yes, no processing is performed;

the big data storage system receives new data files from a data source device through an interface device. And taking the data file set with strong association degree with any business theme in a plurality of data file sets in the big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area. A data file set having a weak association degree with any one of the business subjects in a plurality of data file sets in a large data storage system is taken as a weak association data file set, and the weak association data file set is stored in a weak association storage area.

The association degree of all data files in a strong association data file set with the strong association degree with the first service theme is the strong association degree; the strong association degree is the association degree which is larger than the threshold value of the association degree of the file, and the threshold value of the association degree of the file is 50%, 60%, 70% or 80%. The association degree of all data files in a weak association data file set with weak association degree with a first service theme and the first service theme is the weak association degree; wherein the weak association is an association less than or equal to a file association threshold of 50%, 60%, 70% or 80%.

In step 104, each new data file received by the big data storage system through the interface device is stored in the base storage area and the start storage time of each new data file is recorded, digest data is generated for each new data file and each new data file stored in the base storage area is used as the base data file. Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.

In step 105, the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold are combined into a business topic set of the basic data file, and the basic data file is moved to the temporary storage area and used as a temporary data file.

Specifically, when the accumulated storage time of the basic data file in the basic storage area reaches a second preset time interval, performing association degree calculation on summary data of the basic data file and a plurality of tag items of each business theme to determine association degree of the basic data file and each business theme, forming a business theme set of the basic data file by the business theme with the association degree of the basic data file being greater than or equal to a theme association degree threshold, and when the business theme set is not empty, moving the basic data file to the temporary storage area and using the basic data file moved to the temporary storage area as a temporary data file.

The second predetermined time interval is 10 hours, 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, or 720 hours.

Performing association calculation on summary data of a basic data file and a plurality of tag items of each business theme to determine association of the basic data file and each business theme comprises:

And determining the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic based on semantic matching, keyword matching or text matching, and taking the average value of the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic as the association degree of the basic data file and each business topic.

The topic relevance threshold is 50%, 60%, 70%, 80% or 90%. When the service theme set of the basic data file is empty, the initial storage time of the basic data file is modified to be the current time.

In step 106, a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file is determined in the strong and/or weak associated storage area and a topic match set for the temporary data file is generated.

Specifically, parsing a set of business topics for each temporary data file moved into the temporary storage area to determine at least one business topic associated with the temporary data file, determining a set of strongly-associated data files and/or weakly-associated data files associated with each of the at least one business topic in a strongly-associated storage area and/or a weakly-associated storage area, and generating a topic matching set of temporary data files, the topic matching set comprising at least one matching triplet, each matching triplet having a format < business topic, fileset type, matching rank >, wherein fileset type comprises: and the initial value of the matching grade is 0.

The set of business topics for each temporary data file moved into the temporary storage area includes at least one business topic. Determining a set of strongly-associated data files and/or a set of weakly-associated data files associated with each of the at least one business topic in the strongly-associated storage area and/or the weakly-associated storage area and generating a topic-matched set of temporary data files comprises:

retrieving in the strong association storage area and/or the weak association storage area based on each of the at least one business topic to determine a strong association data file set and/or a weak association data file set associated with each of the at least one business topic and generating a topic matching set of temporary data files.

Each business topic has one or no strongly associated data file set; each business topic has one or no weakly associated data file set. The same business topic may have a strongly associated data file set and/or a weakly associated data file.

In step 107, the temporary data file is copied to the strong and/or weak associated data file set of the corresponding service theme in the theme matching set and marked as a recommended file, and feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme is obtained.

The temporary data files are copied to strong association data file sets and/or weak association data file sets of corresponding service topics in each matching triplet in the topic matching set and marked as recommended files, corresponding relations between the recommended files and the temporary data files are established, and when the fact that the data service from the service request equipment uses the strong association data file sets and/or weak association data file sets of the corresponding service topics in the strong association storage area is determined, feedback information of the service request equipment on the recommended files in the strong association data file sets and/or the weak association data file sets of the corresponding service topics is obtained.

Copying the temporary data file into a strong-association data file set and/or a weak-association data file set of a corresponding business topic in each matching triplet in a topic matching set and marking as a recommended file comprises determining each matching triplet in the topic matching set of the temporary data file, determining each target data file set of the temporary data file based on the business topic and the file set type in each matching triplet, copying the temporary data file into each target data file set and marking as a recommended file, wherein each target data file set can be a strong-association data file set or a weak-association data file set.

When the data service of the service request device uses the strong association data file set and/or the weak association data file set in the big data storage system, the service request device determines feedback information of each recommended file in the strong association data file set and/or the weak association data file set related to the data service, wherein the feedback information comprises: uncorrelated, uncertain and correlated.

At step 108, a temporary data file associated with the received feedback information is determined, and a matching level in a topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level.

Specifically, a temporary data file associated with the received feedback information is determined based on a correspondence between the recommended file and the temporary data file, and a matching level of a matching triplet in a topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level (or the processed matching level).

The feedback information includes uncorrelated, uncertain, or correlated. Processing the match level of the matching triplet in the subject matching set of the associated temporary data file based on the received feedback information comprises: when the feedback information is irrelevant, subtracting 1 from the matching level in the corresponding matching triplet in the subject matching set of the relevant temporary data file; when the feedback information is uncertain, keeping the matching level in the corresponding matching triplet in the subject matching set of the associated temporary data file unchanged; when the feedback information is relevant, adding 1 to the matching grade in the corresponding matching triplet in the theme matching set of the relevant temporary data file;

updating the topic matching set of the associated temporary data file with the determined matching level comprises: at least one matching triplet in the subject matching set of the associated temporary data file is updated with the determined matching level.

In step 109, the business topic and the file set type to which the specific temporary data file stored in the temporary storage area at the time reaching the time threshold belongs are determined, and the specific temporary data file is moved to the strong or weak associated data file set of the strong or weak associated storage area corresponding to the belonged business topic and file set type.

Specifically, when the time of storing the specific temporary data file in the temporary storage area reaches a time threshold value, determining the service theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file, and moving the specific temporary data file to a strong association data file set or a weak association data file set corresponding to the service theme and the file set type to which the specific temporary data file belongs in the strong association storage area or the weak association storage area.

The time threshold is 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, 720 hours, or 1000 hours. Determining the business theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file comprises the following steps: determining the matching grade in each matching triplet in the theme matching set of the specific temporary data file, and taking the business theme and the file set type in the matching triplet with the largest matching grade as the business theme and the file set type to which the specific temporary data file belongs; when at least two matching triples with the largest matching level exist in the specific temporary data file, one matching triplet is randomly selected from the at least two matching triples, and the service theme and the file set type in the randomly selected matching triplet are used as the service theme and the file set type to which the specific temporary data file belongs. Each strongly associated data file set has an associated or attributed business topic; each strongly associated data file set has a business topic. Each weakly associated data file set has an associated or attributed business topic; each weakly associated data file set has a business topic.

FIG. 2 is a schematic diagram of a storage area of the big data storage system of the present invention. The storage area 200 includes: a base storage area 201, a temporary storage area 202, a strongly-associated storage area 203, and a weakly-associated storage area 204. The base storage area 201 is used to store new data files received by the large data storage system via the interface device. The strong-association storage area 203 is used for storing a plurality of existing strong-association data file sets with strong association degree with any one business theme. The weak association storage area 204 is used for storing a plurality of existing weak association data file sets with weak association degree with any one business theme. The temporary storage area 202 is used for temporarily storing data files subjected to the association test.

FIG. 3 is a schematic diagram of a system 300 for associative storage of data files in a large data storage system according to the present invention. The system 300 includes: parsing device 301, extracting device 302, dividing device 303, recording device 304, computing device 305, processing device 306, marking device 307, updating device 308, and mobile device 309.

The parsing device 301 determines a plurality of business topics for a plurality of data businesses that data analyze data files within a large data storage system.

The extraction device 302 performs semantic extraction on each of the plurality of business topics to determine a plurality of tag items associated with each business topic. The semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes: and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.

The dividing device 303 logically divides the storage area of the large data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area.

The recording device 304 stores each new data file received by the large data storage system through the interface device in the base storage area and records the start storage time of each new data file, generates digest data for each new data file and uses each new data file stored in the base storage area as a base data file. Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.

The computing device 305 composes a set of business topics for the base data file from business topics having a degree of association with the base data file greater than or equal to the topic association threshold, moves the base data file to the temporary storage area and uses as a temporary data file.

Processing device 306 determines a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generates a set of topic matches for the temporary data file.

The marking device 307 copies the temporary data file to the strong and/or weak associated data file set of the corresponding service theme in the theme matching set and marks the temporary data file as a recommended file, and obtains feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme.

The update device 308 determines a temporary data file associated with the received feedback information and processes a match rating in a topic match set of the associated temporary data file based on the received feedback information, updating the topic match set of the associated temporary data file with the determined match rating.

Specifically, a temporary data file associated with the received feedback information is determined based on the correspondence between the recommended file and the temporary data file, and the matching level of the matching triplet in the topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level (processed matching level).

The mobile device 309 determines the traffic topic and the file set type to which the specific temporary data file stored in the temporary storage area whose time reaches the time threshold belongs, and moves the specific temporary data file to the strong or weak associated data file set of the strong or weak associated storage area corresponding to the attached traffic topic and file set type.

Claims

1. A method of associative storage of data files in a large data storage system, the method comprising:

determining a strong and/or weak association data file set associated with each business topic associated with the temporary data file in the strong and/or weak association storage area and generating a topic matching set of the temporary data file; the method comprises the steps of taking a data file set with strong association degree with any one business theme in a plurality of data file sets in a big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area; taking a data file set with weak association degree with any one service theme in a plurality of data file sets in a big data storage system as a weak association data file set, and storing the weak association data file set in a weak association storage area; the topic matching set comprises at least one matching triplet, and each matching triplet is in a format of < business topic, file set type and matching grade >, wherein the file set type comprises: the initial value of the matching grade is 0;

2. The method of claim 1, the large data storage system storing a plurality of data files and dividing the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.

3. The method of claim 1, the semantically extracting each of the plurality of business topics to determine a plurality of tag items associated with each business topic comprising:

and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.

4. The method of claim 1, further comprising, prior to logically dividing the storage area of the big data storage system into the base storage area, the strongly-associated storage area, the weakly-associated storage area, and the temporary storage area:

5. The method of any of claims 1-4, generating summary data for each new data file comprising: the file content of each new data file is summarized to generate summary data for each new data file.

6. A system for associative storage of data files in a large data storage system, the system comprising:

Processing means for determining a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generating a topic matching set of temporary data files; the method comprises the steps of taking a data file set with strong association degree with any one business theme in a plurality of data file sets in a big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area; taking a data file set with weak association degree with any one service theme in a plurality of data file sets in a big data storage system as a weak association data file set, and storing the weak association data file set in a weak association storage area; the topic matching set comprises at least one matching triplet, and each matching triplet is in a format of < business topic, file set type and matching grade >, wherein the file set type comprises: the initial value of the matching grade is 0;

7. The system of claim 1, the large data storage system storing a plurality of data files and dividing the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.

8. The system of claim 1, the semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic comprising:

9. The system of claim 1, further comprising, prior to logically dividing the storage area of the big data storage system into the base storage area, the strongly-associated storage area, the weakly-associated storage area, and the temporary storage area:

10. The system of claim 1, generating summary data for each new data file comprising: the file content of each new data file is summarized to generate summary data for each new data file.