CN110515895B - Method and system for carrying out associated storage on data files in big data storage system - Google Patents

Method and system for carrying out associated storage on data files in big data storage system Download PDF

Info

Publication number
CN110515895B
CN110515895B CN201910810845.XA CN201910810845A CN110515895B CN 110515895 B CN110515895 B CN 110515895B CN 201910810845 A CN201910810845 A CN 201910810845A CN 110515895 B CN110515895 B CN 110515895B
Authority
CN
China
Prior art keywords
data file
storage area
data
association
temporary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910810845.XA
Other languages
Chinese (zh)
Other versions
CN110515895A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing yanshan electronic equipment factory
Original Assignee
Beijing yanshan electronic equipment factory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing yanshan electronic equipment factory filed Critical Beijing yanshan electronic equipment factory
Priority to CN201910810845.XA priority Critical patent/CN110515895B/en
Publication of CN110515895A publication Critical patent/CN110515895A/en
Application granted granted Critical
Publication of CN110515895B publication Critical patent/CN110515895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Abstract

The invention discloses a method and a system for carrying out associated storage on data files in a big data storage system, wherein the method comprises the following steps: logically dividing a storage area of a big data storage system; storing each new data file in the base storage area and serving as a base data file; forming a business topic set of the basic data file by the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold, and moving the basic data file to the temporary storage area and using the basic data file as a temporary data file; determining a set of strongly and/or weakly associated data files associated with each business topic associated with the temporary data file and generating a topic-matched set of temporary data files; updating the topic matching set of the associated temporary data file with the determined matching level; the particular temporary data file is moved to a strongly or weakly associated data file set corresponding to the assigned business topic and file set type.

Description

Method and system for carrying out associated storage on data files in big data storage system
Technical Field
The present invention relates to the field of large data storage, and more particularly, to a method and system for associative storage of data files in a large data storage system.
Background
Currently, with the increasing amount of data and the increasing demand for data analysis, large data storage systems have become the infrastructure for many user needs. Large data storage systems are typically capable of storing large amounts of data, thereby providing storage support for data processing. However, in the existing large data storage system, there is no effective way to store the relevance of the data files, but the data provided by the user is directly stored according to the user's needs. This approach results in inefficiency in data storage.
Disclosure of Invention
According to one aspect of the present invention, there is provided a method of associative storage of data files in a large data storage system, the method comprising:
determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on data files in a big data storage system;
semantic extraction is performed on each of the plurality of business topics to determine a plurality of tag items associated with each business topic;
Logically dividing a storage area of a big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;
storing each new data file received by the big data storage system through the interface device in the basic storage area and recording the initial storage time of each new data file, generating summary data for each new data file and using each new data file stored in the basic storage area as a basic data file;
forming a business topic set of the basic data file by the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold, and moving the basic data file to the temporary storage area and using the basic data file as a temporary data file;
determining a strong and/or weak association data file set associated with each business topic associated with the temporary data file in the strong and/or weak association storage area and generating a topic matching set of the temporary data file;
copying the temporary data files into strong and/or weak associated data file sets of corresponding service topics in the topic matching set and marking the temporary data files as recommended files, and acquiring feedback information of service request equipment on the recommended files in the strong and/or weak associated data file sets of the corresponding service topics;
Determining a temporary data file associated with the received feedback information, processing a matching grade in a theme matching set of the associated temporary data file based on the received feedback information, and updating the theme matching set of the associated temporary data file by utilizing the determined matching grade;
and determining the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moving the specific temporary data file to the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.
The big data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic;
the semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes:
semantic extraction is carried out on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and the plurality of keywords capable of describing each business topic are determined to be a plurality of tag items associated with each business topic;
The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area:
determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if yes, no processing is performed; .
Generating summary data for each new data file includes: summarizing file content of each new data file to generate summary data of each new data file;
according to another aspect of the present invention, there is provided a system for associative storage of data files in a big data storage system, the system comprising:
the analysis equipment is used for determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on the data files in the big data storage system;
extracting device, which performs semantic extraction on each service topic in the plurality of service topics to determine a plurality of tag items associated with each service topic;
Dividing equipment for logically dividing a storage area of the big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;
recording means for storing each new data file received by the large data storage system through the interface means in the base storage area and recording a start storage time of each new data file, generating summary data for each new data file and using each new data file stored in the base storage area as a base data file;
the computing equipment forms a business topic set of the basic data file by the business topic with the association degree with the basic data file being greater than or equal to the topic association degree threshold value, and moves the basic data file to the temporary storage area and is used as a temporary data file;
processing means for determining a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generating a topic matching set of temporary data files;
the marking device copies the temporary data file into a strong and/or weak associated data file set of a corresponding service theme in the theme matching set and marks the temporary data file as a recommended file, and obtains feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme;
Updating equipment, which determines a temporary data file associated with the received feedback information, processes the matching grade in the theme matching set of the associated temporary data file based on the received feedback information, and updates the theme matching set of the associated temporary data file by utilizing the determined matching grade;
and the mobile equipment determines the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moves the specific temporary data file into the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.
The big data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic;
the semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes:
semantic extraction is carried out on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and the plurality of keywords capable of describing each business topic are determined to be a plurality of tag items associated with each business topic;
The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area:
determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if so, no processing is performed.
Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.
Drawings
FIG. 1 is a flow chart of a method of associative storage of data files in a large data storage system in accordance with the present invention;
FIG. 2 is a schematic diagram of the storage area of the big data storage system of the present invention; and
FIG. 3 is a schematic diagram of a system for associative storage of data files in a large data storage system according to the present invention.
Detailed Description
FIG. 1 is a flow chart of a method 100 of storing data files in association in a large data storage system in accordance with the present invention.
In step 101, a plurality of business topics relating to a plurality of data businesses that perform data analysis on data files within a large data storage system are determined.
Specifically, the description information of each data service in a plurality of data services which are subjected to data analysis aiming at the data file in the big data storage system in a first preset time interval is obtained, the description information of each data service is analyzed to determine the data theme of each data service, so that a plurality of data themes are obtained, the same data theme combination statistics are carried out on the plurality of data themes to determine the statistics times of each data theme, and the data themes with the statistics times exceeding a time threshold are selected as the business themes to determine the plurality of business themes.
The first predetermined time interval is 5 natural days, 10 natural days, 20 natural days, 30 natural days, 50 natural days, 80 natural days, 100 natural days, or 200 natural days. The first predetermined time interval is a period of time with an end date being a previous natural day to the natural day on which the current time is located and a start date being a natural day that has elapsed, and the first predetermined time interval includes a plurality of natural days, or at least 5 natural days, or includes an integer number of natural days. The large data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.
The data service is a service initiated by the service request device for data analysis for a plurality of data files in the big data storage system. Each data service has description information, and the description information of the data service includes: an identifier of the service request device, a network address of the service request device, a topic name of the data service, and a service domain of the data service.
Parsing the description information of each data service to determine a data topic of each data service includes: and analyzing the description information of each data service to acquire the topic name and the service field of each data service, and combining the topic name and the service field of each data service to determine the data topic of the data service.
At least one identical data topic exists among the plurality of data topics. The step of carrying out the merging statistics of the same data topics on the plurality of data topics to determine the statistics of each data topic comprises the following steps: and carrying out merging statistics of the same data topics on the plurality of data topics to determine the statistics of each different data topic. The number of times threshold is 2 times, 3 times, 5 times, 8 times, 10 times, 15 times or 20 times.
At step 102, semantic extraction is performed on each of the plurality of business topics to determine a plurality of tag items associated with each business topic. The semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes: and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.
In step 103, the storage area of the big data storage system is logically divided into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area.
The method comprises the steps of dividing a storage area of a large data storage system into a basic storage area, a strong association storage area, a weak association storage area and a temporary storage area logically, wherein the basic storage area is used for storing new data files received by the large data storage system through an interface device, the strong association storage area is used for storing a plurality of existing strong association data file sets with strong association with any business theme, the weak association storage area is used for storing a plurality of existing weak association data file sets with weak association with any business theme, and the temporary storage area is used for temporarily storing data files subjected to association test.
The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area: determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if yes, no processing is performed;
the big data storage system receives new data files from a data source device through an interface device. And taking the data file set with strong association degree with any business theme in a plurality of data file sets in the big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area. A data file set having a weak association degree with any one of the business subjects in a plurality of data file sets in a large data storage system is taken as a weak association data file set, and the weak association data file set is stored in a weak association storage area.
The association degree of all data files in a strong association data file set with the strong association degree with the first service theme is the strong association degree; the strong association degree is the association degree which is larger than the threshold value of the association degree of the file, and the threshold value of the association degree of the file is 50%, 60%, 70% or 80%. The association degree of all data files in a weak association data file set with weak association degree with a first service theme and the first service theme is the weak association degree; wherein the weak association is an association less than or equal to a file association threshold of 50%, 60%, 70% or 80%.
In step 104, each new data file received by the big data storage system through the interface device is stored in the base storage area and the start storage time of each new data file is recorded, digest data is generated for each new data file and each new data file stored in the base storage area is used as the base data file. Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.
In step 105, the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold are combined into a business topic set of the basic data file, and the basic data file is moved to the temporary storage area and used as a temporary data file.
Specifically, when the accumulated storage time of the basic data file in the basic storage area reaches a second preset time interval, performing association degree calculation on summary data of the basic data file and a plurality of tag items of each business theme to determine association degree of the basic data file and each business theme, forming a business theme set of the basic data file by the business theme with the association degree of the basic data file being greater than or equal to a theme association degree threshold, and when the business theme set is not empty, moving the basic data file to the temporary storage area and using the basic data file moved to the temporary storage area as a temporary data file.
The second predetermined time interval is 10 hours, 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, or 720 hours.
Performing association calculation on summary data of a basic data file and a plurality of tag items of each business theme to determine association of the basic data file and each business theme comprises:
And determining the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic based on semantic matching, keyword matching or text matching, and taking the average value of the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic as the association degree of the basic data file and each business topic.
The topic relevance threshold is 50%, 60%, 70%, 80% or 90%. When the service theme set of the basic data file is empty, the initial storage time of the basic data file is modified to be the current time.
In step 106, a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file is determined in the strong and/or weak associated storage area and a topic match set for the temporary data file is generated.
Specifically, parsing a set of business topics for each temporary data file moved into the temporary storage area to determine at least one business topic associated with the temporary data file, determining a set of strongly-associated data files and/or weakly-associated data files associated with each of the at least one business topic in a strongly-associated storage area and/or a weakly-associated storage area, and generating a topic matching set of temporary data files, the topic matching set comprising at least one matching triplet, each matching triplet having a format < business topic, fileset type, matching rank >, wherein fileset type comprises: and the initial value of the matching grade is 0.
The set of business topics for each temporary data file moved into the temporary storage area includes at least one business topic. Determining a set of strongly-associated data files and/or a set of weakly-associated data files associated with each of the at least one business topic in the strongly-associated storage area and/or the weakly-associated storage area and generating a topic-matched set of temporary data files comprises:
retrieving in the strong association storage area and/or the weak association storage area based on each of the at least one business topic to determine a strong association data file set and/or a weak association data file set associated with each of the at least one business topic and generating a topic matching set of temporary data files.
Each business topic has one or no strongly associated data file set; each business topic has one or no weakly associated data file set. The same business topic may have a strongly associated data file set and/or a weakly associated data file.
In step 107, the temporary data file is copied to the strong and/or weak associated data file set of the corresponding service theme in the theme matching set and marked as a recommended file, and feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme is obtained.
The temporary data files are copied to strong association data file sets and/or weak association data file sets of corresponding service topics in each matching triplet in the topic matching set and marked as recommended files, corresponding relations between the recommended files and the temporary data files are established, and when the fact that the data service from the service request equipment uses the strong association data file sets and/or weak association data file sets of the corresponding service topics in the strong association storage area is determined, feedback information of the service request equipment on the recommended files in the strong association data file sets and/or the weak association data file sets of the corresponding service topics is obtained.
Copying the temporary data file into a strong-association data file set and/or a weak-association data file set of a corresponding business topic in each matching triplet in a topic matching set and marking as a recommended file comprises determining each matching triplet in the topic matching set of the temporary data file, determining each target data file set of the temporary data file based on the business topic and the file set type in each matching triplet, copying the temporary data file into each target data file set and marking as a recommended file, wherein each target data file set can be a strong-association data file set or a weak-association data file set.
When the data service of the service request device uses the strong association data file set and/or the weak association data file set in the big data storage system, the service request device determines feedback information of each recommended file in the strong association data file set and/or the weak association data file set related to the data service, wherein the feedback information comprises: uncorrelated, uncertain and correlated.
At step 108, a temporary data file associated with the received feedback information is determined, and a matching level in a topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level.
Specifically, a temporary data file associated with the received feedback information is determined based on a correspondence between the recommended file and the temporary data file, and a matching level of a matching triplet in a topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level (or the processed matching level).
The feedback information includes uncorrelated, uncertain, or correlated. Processing the match level of the matching triplet in the subject matching set of the associated temporary data file based on the received feedback information comprises: when the feedback information is irrelevant, subtracting 1 from the matching level in the corresponding matching triplet in the subject matching set of the relevant temporary data file; when the feedback information is uncertain, keeping the matching level in the corresponding matching triplet in the subject matching set of the associated temporary data file unchanged; when the feedback information is relevant, adding 1 to the matching grade in the corresponding matching triplet in the theme matching set of the relevant temporary data file;
updating the topic matching set of the associated temporary data file with the determined matching level comprises: at least one matching triplet in the subject matching set of the associated temporary data file is updated with the determined matching level.
In step 109, the business topic and the file set type to which the specific temporary data file stored in the temporary storage area at the time reaching the time threshold belongs are determined, and the specific temporary data file is moved to the strong or weak associated data file set of the strong or weak associated storage area corresponding to the belonged business topic and file set type.
Specifically, when the time of storing the specific temporary data file in the temporary storage area reaches a time threshold value, determining the service theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file, and moving the specific temporary data file to a strong association data file set or a weak association data file set corresponding to the service theme and the file set type to which the specific temporary data file belongs in the strong association storage area or the weak association storage area.
The time threshold is 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, 720 hours, or 1000 hours. Determining the business theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file comprises the following steps: determining the matching grade in each matching triplet in the theme matching set of the specific temporary data file, and taking the business theme and the file set type in the matching triplet with the largest matching grade as the business theme and the file set type to which the specific temporary data file belongs; when at least two matching triples with the largest matching level exist in the specific temporary data file, one matching triplet is randomly selected from the at least two matching triples, and the service theme and the file set type in the randomly selected matching triplet are used as the service theme and the file set type to which the specific temporary data file belongs. Each strongly associated data file set has an associated or attributed business topic; each strongly associated data file set has a business topic. Each weakly associated data file set has an associated or attributed business topic; each weakly associated data file set has a business topic.
FIG. 2 is a schematic diagram of a storage area of the big data storage system of the present invention. The storage area 200 includes: a base storage area 201, a temporary storage area 202, a strongly-associated storage area 203, and a weakly-associated storage area 204. The base storage area 201 is used to store new data files received by the large data storage system via the interface device. The strong-association storage area 203 is used for storing a plurality of existing strong-association data file sets with strong association degree with any one business theme. The weak association storage area 204 is used for storing a plurality of existing weak association data file sets with weak association degree with any one business theme. The temporary storage area 202 is used for temporarily storing data files subjected to the association test.
FIG. 3 is a schematic diagram of a system 300 for associative storage of data files in a large data storage system according to the present invention. The system 300 includes: parsing device 301, extracting device 302, dividing device 303, recording device 304, computing device 305, processing device 306, marking device 307, updating device 308, and mobile device 309.
The parsing device 301 determines a plurality of business topics for a plurality of data businesses that data analyze data files within a large data storage system.
Specifically, the description information of each data service in a plurality of data services which are subjected to data analysis aiming at the data file in the big data storage system in a first preset time interval is obtained, the description information of each data service is analyzed to determine the data theme of each data service, so that a plurality of data themes are obtained, the same data theme combination statistics are carried out on the plurality of data themes to determine the statistics times of each data theme, and the data themes with the statistics times exceeding a time threshold are selected as the business themes to determine the plurality of business themes.
The first predetermined time interval is 5 natural days, 10 natural days, 20 natural days, 30 natural days, 50 natural days, 80 natural days, 100 natural days, or 200 natural days. The first predetermined time interval is a period of time with an end date being a previous natural day to the natural day on which the current time is located and a start date being a natural day that has elapsed, and the first predetermined time interval includes a plurality of natural days, or at least 5 natural days, or includes an integer number of natural days. The large data storage system stores a plurality of data files and divides the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.
The data service is a service initiated by the service request device for data analysis for a plurality of data files in the big data storage system. Each data service has description information, and the description information of the data service includes: an identifier of the service request device, a network address of the service request device, a topic name of the data service, and a service domain of the data service.
Parsing the description information of each data service to determine a data topic of each data service includes: and analyzing the description information of each data service to acquire the topic name and the service field of each data service, and combining the topic name and the service field of each data service to determine the data topic of the data service.
At least one identical data topic exists among the plurality of data topics. The step of carrying out the merging statistics of the same data topics on the plurality of data topics to determine the statistics of each data topic comprises the following steps: and carrying out merging statistics of the same data topics on the plurality of data topics to determine the statistics of each different data topic. The number of times threshold is 2 times, 3 times, 5 times, 8 times, 10 times, 15 times or 20 times.
The extraction device 302 performs semantic extraction on each of the plurality of business topics to determine a plurality of tag items associated with each business topic. The semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic includes: and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.
The dividing device 303 logically divides the storage area of the large data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area.
The method comprises the steps of dividing a storage area of a large data storage system into a basic storage area, a strong association storage area, a weak association storage area and a temporary storage area logically, wherein the basic storage area is used for storing new data files received by the large data storage system through an interface device, the strong association storage area is used for storing a plurality of existing strong association data file sets with strong association with any business theme, the weak association storage area is used for storing a plurality of existing weak association data file sets with weak association with any business theme, and the temporary storage area is used for temporarily storing data files subjected to association test.
The method further comprises, before logically dividing the storage area of the big data storage system into a base storage area, a strongly-associated storage area, a weakly-associated storage area, and a temporary storage area: determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if yes, no processing is performed;
the big data storage system receives new data files from a data source device through an interface device. And taking the data file set with strong association degree with any business theme in a plurality of data file sets in the big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area. A data file set having a weak association degree with any one of the business subjects in a plurality of data file sets in a large data storage system is taken as a weak association data file set, and the weak association data file set is stored in a weak association storage area.
The association degree of all data files in a strong association data file set with the strong association degree with the first service theme is the strong association degree; the strong association degree is the association degree which is larger than the threshold value of the association degree of the file, and the threshold value of the association degree of the file is 50%, 60%, 70% or 80%. The association degree of all data files in a weak association data file set with weak association degree with a first service theme and the first service theme is the weak association degree; wherein the weak association is an association less than or equal to a file association threshold of 50%, 60%, 70% or 80%.
The recording device 304 stores each new data file received by the large data storage system through the interface device in the base storage area and records the start storage time of each new data file, generates digest data for each new data file and uses each new data file stored in the base storage area as a base data file. Generating summary data for each new data file includes: the file content of each new data file is summarized to generate summary data for each new data file.
The computing device 305 composes a set of business topics for the base data file from business topics having a degree of association with the base data file greater than or equal to the topic association threshold, moves the base data file to the temporary storage area and uses as a temporary data file.
Specifically, when the accumulated storage time of the basic data file in the basic storage area reaches a second preset time interval, performing association degree calculation on summary data of the basic data file and a plurality of tag items of each business theme to determine association degree of the basic data file and each business theme, forming a business theme set of the basic data file by the business theme with the association degree of the basic data file being greater than or equal to a theme association degree threshold, and when the business theme set is not empty, moving the basic data file to the temporary storage area and using the basic data file moved to the temporary storage area as a temporary data file.
The second predetermined time interval is 10 hours, 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, or 720 hours.
Performing association calculation on summary data of a basic data file and a plurality of tag items of each business theme to determine association of the basic data file and each business theme comprises:
And determining the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic based on semantic matching, keyword matching or text matching, and taking the average value of the association degree of the summary data of the basic data file and each tag item in the plurality of tag items of each business topic as the association degree of the basic data file and each business topic.
The topic relevance threshold is 50%, 60%, 70%, 80% or 90%. When the service theme set of the basic data file is empty, the initial storage time of the basic data file is modified to be the current time.
Processing device 306 determines a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generates a set of topic matches for the temporary data file.
Specifically, parsing a set of business topics for each temporary data file moved into the temporary storage area to determine at least one business topic associated with the temporary data file, determining a set of strongly-associated data files and/or weakly-associated data files associated with each of the at least one business topic in a strongly-associated storage area and/or a weakly-associated storage area, and generating a topic matching set of temporary data files, the topic matching set comprising at least one matching triplet, each matching triplet having a format < business topic, fileset type, matching rank >, wherein fileset type comprises: and the initial value of the matching grade is 0.
The set of business topics for each temporary data file moved into the temporary storage area includes at least one business topic. Determining a set of strongly-associated data files and/or a set of weakly-associated data files associated with each of the at least one business topic in the strongly-associated storage area and/or the weakly-associated storage area and generating a topic-matched set of temporary data files comprises:
retrieving in the strong association storage area and/or the weak association storage area based on each of the at least one business topic to determine a strong association data file set and/or a weak association data file set associated with each of the at least one business topic and generating a topic matching set of temporary data files.
Each business topic has one or no strongly associated data file set; each business topic has one or no weakly associated data file set. The same business topic may have a strongly associated data file set and/or a weakly associated data file.
The marking device 307 copies the temporary data file to the strong and/or weak associated data file set of the corresponding service theme in the theme matching set and marks the temporary data file as a recommended file, and obtains feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme.
The temporary data files are copied to strong association data file sets and/or weak association data file sets of corresponding service topics in each matching triplet in the topic matching set and marked as recommended files, corresponding relations between the recommended files and the temporary data files are established, and when the fact that the data service from the service request equipment uses the strong association data file sets and/or weak association data file sets of the corresponding service topics in the strong association storage area is determined, feedback information of the service request equipment on the recommended files in the strong association data file sets and/or the weak association data file sets of the corresponding service topics is obtained.
Copying the temporary data file into a strong-association data file set and/or a weak-association data file set of a corresponding business topic in each matching triplet in a topic matching set and marking as a recommended file comprises determining each matching triplet in the topic matching set of the temporary data file, determining each target data file set of the temporary data file based on the business topic and the file set type in each matching triplet, copying the temporary data file into each target data file set and marking as a recommended file, wherein each target data file set can be a strong-association data file set or a weak-association data file set.
When the data service of the service request device uses the strong association data file set and/or the weak association data file set in the big data storage system, the service request device determines feedback information of each recommended file in the strong association data file set and/or the weak association data file set related to the data service, wherein the feedback information comprises: uncorrelated, uncertain and correlated.
The update device 308 determines a temporary data file associated with the received feedback information and processes a match rating in a topic match set of the associated temporary data file based on the received feedback information, updating the topic match set of the associated temporary data file with the determined match rating.
Specifically, a temporary data file associated with the received feedback information is determined based on the correspondence between the recommended file and the temporary data file, and the matching level of the matching triplet in the topic matching set of the associated temporary data file is processed based on the received feedback information, and the topic matching set of the associated temporary data file is updated with the determined matching level (processed matching level).
The feedback information includes uncorrelated, uncertain, or correlated. Processing the match level of the matching triplet in the subject matching set of the associated temporary data file based on the received feedback information comprises: when the feedback information is irrelevant, subtracting 1 from the matching level in the corresponding matching triplet in the subject matching set of the relevant temporary data file; when the feedback information is uncertain, keeping the matching level in the corresponding matching triplet in the subject matching set of the associated temporary data file unchanged; when the feedback information is relevant, adding 1 to the matching grade in the corresponding matching triplet in the theme matching set of the relevant temporary data file;
updating the topic matching set of the associated temporary data file with the determined matching level comprises: at least one matching triplet in the subject matching set of the associated temporary data file is updated with the determined matching level.
The mobile device 309 determines the traffic topic and the file set type to which the specific temporary data file stored in the temporary storage area whose time reaches the time threshold belongs, and moves the specific temporary data file to the strong or weak associated data file set of the strong or weak associated storage area corresponding to the attached traffic topic and file set type.
Specifically, when the time of storing the specific temporary data file in the temporary storage area reaches a time threshold value, determining the service theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file, and moving the specific temporary data file to a strong association data file set or a weak association data file set corresponding to the service theme and the file set type to which the specific temporary data file belongs in the strong association storage area or the weak association storage area.
The time threshold is 20 hours, 50 hours, 100 hours, 150 hours, 300 hours, 720 hours, or 1000 hours. Determining the business theme and the file set type to which the specific temporary data file belongs according to the theme matching set of the specific temporary data file comprises the following steps: determining the matching grade in each matching triplet in the theme matching set of the specific temporary data file, and taking the business theme and the file set type in the matching triplet with the largest matching grade as the business theme and the file set type to which the specific temporary data file belongs; when at least two matching triples with the largest matching level exist in the specific temporary data file, one matching triplet is randomly selected from the at least two matching triples, and the service theme and the file set type in the randomly selected matching triplet are used as the service theme and the file set type to which the specific temporary data file belongs. Each strongly associated data file set has an associated or attributed business topic; each strongly associated data file set has a business topic. Each weakly associated data file set has an associated or attributed business topic; each weakly associated data file set has a business topic.

Claims (10)

1. A method of associative storage of data files in a large data storage system, the method comprising:
determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on data files in a big data storage system;
semantic extraction is performed on each of the plurality of business topics to determine a plurality of tag items associated with each business topic;
logically dividing a storage area of a big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;
storing each new data file received by the big data storage system through the interface device in the basic storage area and recording the initial storage time of each new data file, generating summary data for each new data file and using each new data file stored in the basic storage area as a basic data file;
forming a business topic set of the basic data file by the business topics with the association degree with the basic data file being greater than or equal to the topic association degree threshold, and moving the basic data file to the temporary storage area and using the basic data file as a temporary data file;
determining a strong and/or weak association data file set associated with each business topic associated with the temporary data file in the strong and/or weak association storage area and generating a topic matching set of the temporary data file; the method comprises the steps of taking a data file set with strong association degree with any one business theme in a plurality of data file sets in a big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area; taking a data file set with weak association degree with any one service theme in a plurality of data file sets in a big data storage system as a weak association data file set, and storing the weak association data file set in a weak association storage area; the topic matching set comprises at least one matching triplet, and each matching triplet is in a format of < business topic, file set type and matching grade >, wherein the file set type comprises: the initial value of the matching grade is 0;
Copying the temporary data files into strong and/or weak associated data file sets of corresponding service topics in the topic matching set and marking the temporary data files as recommended files, and acquiring feedback information of service request equipment on the recommended files in the strong and/or weak associated data file sets of the corresponding service topics;
determining a temporary data file associated with the received feedback information, processing a matching grade in a theme matching set of the associated temporary data file based on the received feedback information, and updating the theme matching set of the associated temporary data file by utilizing the determined matching grade;
and determining the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moving the specific temporary data file to the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.
2. The method of claim 1, the large data storage system storing a plurality of data files and dividing the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.
3. The method of claim 1, the semantically extracting each of the plurality of business topics to determine a plurality of tag items associated with each business topic comprising:
and carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.
4. The method of claim 1, further comprising, prior to logically dividing the storage area of the big data storage system into the base storage area, the strongly-associated storage area, the weakly-associated storage area, and the temporary storage area:
determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if so, no processing is performed.
5. The method of any of claims 1-4, generating summary data for each new data file comprising: the file content of each new data file is summarized to generate summary data for each new data file.
6. A system for associative storage of data files in a large data storage system, the system comprising:
the analysis equipment is used for determining a plurality of business topics related to a plurality of data businesses for carrying out data analysis on the data files in the big data storage system;
extracting device, which performs semantic extraction on each service topic in the plurality of service topics to determine a plurality of tag items associated with each service topic;
dividing equipment for logically dividing a storage area of the big data storage system into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area;
recording means for storing each new data file received by the large data storage system through the interface means in the base storage area and recording a start storage time of each new data file, generating summary data for each new data file and using each new data file stored in the base storage area as a base data file;
the computing equipment forms a business topic set of the basic data file by the business topic with the association degree with the basic data file being greater than or equal to the topic association degree threshold value, and moves the basic data file to the temporary storage area and is used as a temporary data file;
Processing means for determining a set of strong and/or weak associated data files associated with each business topic associated with the temporary data file in the strong and/or weak associated storage area and generating a topic matching set of temporary data files; the method comprises the steps of taking a data file set with strong association degree with any one business theme in a plurality of data file sets in a big data storage system as a strong association data file set, and storing the strong association data file set in a strong association storage area; taking a data file set with weak association degree with any one service theme in a plurality of data file sets in a big data storage system as a weak association data file set, and storing the weak association data file set in a weak association storage area; the topic matching set comprises at least one matching triplet, and each matching triplet is in a format of < business topic, file set type and matching grade >, wherein the file set type comprises: the initial value of the matching grade is 0;
the marking device copies the temporary data file into a strong and/or weak associated data file set of a corresponding service theme in the theme matching set and marks the temporary data file as a recommended file, and obtains feedback information of the service request device on the recommended file in the strong and/or weak associated data file set of the corresponding service theme;
Updating equipment, which determines a temporary data file associated with the received feedback information, processes the matching grade in the theme matching set of the associated temporary data file based on the received feedback information, and updates the theme matching set of the associated temporary data file by utilizing the determined matching grade;
and the mobile equipment determines the service theme and the file set type to which the specific temporary data file stored in the temporary storage area reaches the time threshold, and moves the specific temporary data file into the strong or weak association data file set corresponding to the belonged service theme and file set type in the strong or weak association storage area.
7. The system of claim 1, the large data storage system storing a plurality of data files and dividing the stored plurality of data files into a plurality of data file sets, each data file set having an associated or attributed business topic.
8. The system of claim 1, the semantic extraction of each of the plurality of business topics to determine a plurality of tag items associated with each business topic comprising:
And carrying out semantic extraction on each business topic in the plurality of business topics to obtain a plurality of keywords capable of describing each business topic, and determining the plurality of keywords capable of describing each business topic as a plurality of tag items associated with each business topic.
9. The system of claim 1, further comprising, prior to logically dividing the storage area of the big data storage system into the base storage area, the strongly-associated storage area, the weakly-associated storage area, and the temporary storage area:
determining whether a storage area of the big data storage system is logically divided into a basic storage area, a strong-association storage area, a weak-association storage area and a temporary storage area, and if not, logically dividing the storage area of the big data storage system into the basic storage area, the strong-association storage area, the weak-association storage area and the temporary storage area; if so, no processing is performed.
10. The system of claim 1, generating summary data for each new data file comprising: the file content of each new data file is summarized to generate summary data for each new data file.
CN201910810845.XA 2019-08-30 2019-08-30 Method and system for carrying out associated storage on data files in big data storage system Active CN110515895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910810845.XA CN110515895B (en) 2019-08-30 2019-08-30 Method and system for carrying out associated storage on data files in big data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910810845.XA CN110515895B (en) 2019-08-30 2019-08-30 Method and system for carrying out associated storage on data files in big data storage system

Publications (2)

Publication Number Publication Date
CN110515895A CN110515895A (en) 2019-11-29
CN110515895B true CN110515895B (en) 2023-06-23

Family

ID=68629305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910810845.XA Active CN110515895B (en) 2019-08-30 2019-08-30 Method and system for carrying out associated storage on data files in big data storage system

Country Status (1)

Country Link
CN (1) CN110515895B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112947844A (en) * 2019-12-11 2021-06-11 北京金山云网络技术有限公司 Data storage method and device, electronic equipment and medium
CN111159434A (en) * 2019-12-29 2020-05-15 赵娜 Method and system for storing multimedia file in Internet storage cluster
CN114625320B (en) * 2022-03-15 2024-01-02 江苏太湖慧云数据系统有限公司 Hybrid cloud platform data management system based on characteristics

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6308176B1 (en) * 1998-04-24 2001-10-23 The Dialog Corporation Plc Associating files of data
CN106033438A (en) * 2015-03-13 2016-10-19 北大方正集团有限公司 Public sentiment data storage method and server
CN106919691A (en) * 2017-03-06 2017-07-04 广东神马搜索科技有限公司 Method, device and the searching system retrieved based on web page library
CN108427703A (en) * 2017-02-15 2018-08-21 谷歌有限责任公司 The system and method accessed the data file for being stored in data-storage system are provided
CN108829764A (en) * 2018-05-28 2018-11-16 腾讯科技(深圳)有限公司 Recommendation information acquisition methods, device, system, server and storage medium
CN109542909A (en) * 2018-11-25 2019-03-29 杜广香 Identify the method and system of the relevance storage equipment in big data storage system
CN109739817A (en) * 2018-12-26 2019-05-10 杜广香 A kind of method and system of the storing data file in big data storage system
CN109753505A (en) * 2018-12-26 2019-05-14 杜广香 The method and system of temporary storage cell are created in big data storage system
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6308176B1 (en) * 1998-04-24 2001-10-23 The Dialog Corporation Plc Associating files of data
CN106033438A (en) * 2015-03-13 2016-10-19 北大方正集团有限公司 Public sentiment data storage method and server
CN108427703A (en) * 2017-02-15 2018-08-21 谷歌有限责任公司 The system and method accessed the data file for being stored in data-storage system are provided
CN106919691A (en) * 2017-03-06 2017-07-04 广东神马搜索科技有限公司 Method, device and the searching system retrieved based on web page library
CN108829764A (en) * 2018-05-28 2018-11-16 腾讯科技(深圳)有限公司 Recommendation information acquisition methods, device, system, server and storage medium
CN109542909A (en) * 2018-11-25 2019-03-29 杜广香 Identify the method and system of the relevance storage equipment in big data storage system
CN109739817A (en) * 2018-12-26 2019-05-10 杜广香 A kind of method and system of the storing data file in big data storage system
CN109753505A (en) * 2018-12-26 2019-05-14 杜广香 The method and system of temporary storage cell are created in big data storage system
CN110032639A (en) * 2018-12-27 2019-07-19 中国银联股份有限公司 By the method, apparatus and storage medium of semantic text data and tag match

Also Published As

Publication number Publication date
CN110515895A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
US9792340B2 (en) Identifying data items
CN110515895B (en) Method and system for carrying out associated storage on data files in big data storage system
US8612444B2 (en) Data classifier
US8095547B2 (en) Method and apparatus for detecting spam user created content
US20100281005A1 (en) Asynchronous Database Index Maintenance
KR101609088B1 (en) Media identification system with fingerprint database balanced according to search loads
US10430448B2 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
CN110309251B (en) Text data processing method, device and computer readable storage medium
US20140351273A1 (en) System and method for searching information
JP2012198832A (en) Duplicate file detection device
US20150206101A1 (en) System for determining infringement of copyright based on the text reference point and method thereof
CN111899822B (en) Medical institution database construction method, query method, device, equipment and medium
CN109947730B (en) Metadata recovery method, device, distributed file system and readable storage medium
CN109271545A (en) A kind of characteristic key method and device, storage medium and computer equipment
CN107590233A (en) A kind of file management method and device
CN103530311A (en) Method and apparatus for prioritizing metadata
CN111259017B (en) Order retrieval method, computer device, and storage medium
CN107169065B (en) Method and device for removing specific content
SalahEldeen et al. Reading the correct history? Modeling temporal intention in resource sharing
US20070219989A1 (en) Document retrieval
US8650209B1 (en) System, method, and computer program for determining most of the non duplicate records in high performance environments in an economical and fault-tolerant manner
KR101147508B1 (en) Apparatus and Method for recommending of search formula
CN116595065B (en) Content duplicate identification method, device, system and storage medium
CN114153830B (en) Data verification method and device, computer storage medium and electronic equipment
CN113660277B (en) Crawler-resisting method based on multiplexing embedded point information and processing terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230531

Address after: 100000 Courtyard 38 Yongtaizhuang North Road, Qinghe, Haidian District, Beijing

Applicant after: BEIJING YANSHAN ELECTRONIC EQUIPMENT FACTORY

Address before: 1-1-6-1, Ningxi Road, Huanggu District, Shenyang City, Liaoning Province, 110036

Applicant before: Mi Naibin

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant