CN116737884B - Method and system for unstructured data full life cycle management - Google Patents

Method and system for unstructured data full life cycle management Download PDF

Info

Publication number
CN116737884B
CN116737884B CN202311031481.8A CN202311031481A CN116737884B CN 116737884 B CN116737884 B CN 116737884B CN 202311031481 A CN202311031481 A CN 202311031481A CN 116737884 B CN116737884 B CN 116737884B
Authority
CN
China
Prior art keywords
data
management
document
unstructured data
graphic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311031481.8A
Other languages
Chinese (zh)
Other versions
CN116737884A (en
Inventor
马欣
于飞
徐旭
章欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beiming Mingrun Beijing Technology Co ltd
Original Assignee
Beiming Mingrun Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beiming Mingrun Beijing Technology Co ltd filed Critical Beiming Mingrun Beijing Technology Co ltd
Priority to CN202311031481.8A priority Critical patent/CN116737884B/en
Publication of CN116737884A publication Critical patent/CN116737884A/en
Application granted granted Critical
Publication of CN116737884B publication Critical patent/CN116737884B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Manufacturing & Machinery (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a system for managing the full life cycle of unstructured data, which relate to the technical field of data management, count unstructured data of a designated unit, extract a graphic data set and a document data set, determine the graphic adaptation degree set and the document adaptation degree set by combining unstructured data standards, calculate the unstructured data adaptation degree set, analyze local outliers of a plurality of unstructured data, calculate a plurality of management parameters by combining a data management function, decide to acquire a plurality of management schemes to manage the unstructured data set, solve the technical problem that the prior art cannot adapt to the ordered overall management of fluctuating multisource data, cause poor data management effect, count unstructured data, extract document data and graphic data, respectively configure a targeted processing mode to process, configure an adaptation management scheme based on the determined management parameters, and effectively manage the full life cycle of the data.

Description

Method and system for unstructured data full life cycle management
Technical Field
The application relates to the technical field of data management, in particular to a method and a system for unstructured data full life cycle management.
Background
In factory operation, along with updating iteration of a process, corresponding derivative data are synchronously updated, generation of multi-source data, asynchronous data and the like causes data management difficulty, and in order to ensure a data management effect, at present, unstructured data is processed and managed based on a specific query language or tool mainly in a database management or system management mode, so that certain technical limitations exist, and orderly overall management of fluctuation multi-source data cannot be adapted, so that the data management effect is poor.
Disclosure of Invention
The application provides a method and a system for unstructured data full life cycle management, which are used for solving the technical problem that the data management effect is poor due to the fact that ordered overall management of fluctuating multisource data cannot be adapted in the prior art.
In view of the foregoing, the present application provides a method and system for unstructured data full lifecycle management.
In a first aspect, the present application provides a method for unstructured data full lifecycle management, the method comprising:
the unstructured data management device is accessed into a database of a designated unit through an interface, and all unstructured data to be managed in the designated unit are counted to obtain an unstructured data set;
extracting graphic data and document data in the unstructured data set through the graphic extractor and the document extractor, and integrating to obtain a graphic data set and a document data set;
the unstructured data standard in the appointed unit is obtained, wherein the unstructured data standard comprises a graph standard and a document standard, and the graph standard comprises a mapping relation of a plurality of graphs and a plurality of fitness degrees;
in the processor, according to the graph standard and the document standard, performing applicability analysis on the graph data set and the document data set to obtain a graph adaptation degree set and a document adaptation degree set, and calculating to obtain an unstructured data adaptation degree set;
according to the unstructured data suitability set, analyzing local outliers of a plurality of unstructured data in the unstructured data set to obtain a plurality of local outliers;
substituting a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters;
and the decision is made to acquire a plurality of management schemes, and the unstructured data set is managed, wherein the plurality of management schemes are acquired through decision making according to the plurality of management parameters.
In a second aspect, the present application provides a system for unstructured data full lifecycle management, the system comprising:
the data statistics module is used for accessing the unstructured data management device into a database of a designated unit through an interface, and counting all unstructured data to be managed in the designated unit to obtain an unstructured data set;
the data extraction module is used for extracting the graphic data and the document data in the unstructured data set through the graphic extractor and the document extractor, and integrating the graphic data set and the document data set;
the data standard acquisition module is used for acquiring unstructured data standards in the specified units, wherein the unstructured data standards comprise graphic standards and document standards, and the graphic standards comprise mapping relations of a plurality of graphics and a plurality of fitness degrees;
the applicability analysis module is used for carrying out applicability analysis on the graphic data set and the document data set in the processor according to the graphic standard and the document standard to obtain a graphic applicability set and a document applicability set, and calculating to obtain an unstructured data applicability set;
the local outlier analysis module is used for analyzing local outliers of a plurality of unstructured data in the unstructured data set according to the unstructured data suitability set to obtain a plurality of local outliers;
the management parameter calculation module is used for substituting a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters;
and the data management module is used for deciding to acquire a plurality of management schemes and managing the unstructured data set, wherein the plurality of management schemes are acquired through decision making according to the plurality of management parameters.
One or more technical schemes provided by the application have at least the following technical effects or advantages:
the method for managing the full life cycle of unstructured data provided by the embodiment of the application comprises the steps of accessing an unstructured data management device into a database of a designated unit through an interface, counting all unstructured data to be managed in the designated unit, extracting graphic data and document data through a graphic extractor and a document extractor, and integrating to obtain a graphic data set and a document data set; the method comprises the steps of obtaining unstructured data standards in a specified unit, including graphic standards and document standards, in a processor, analyzing the applicability of a graphic data set and a document data set according to the graphic standards and the document standards, obtaining the graphic applicability set and the document applicability set, calculating the unstructured data applicability set, further analyzing local outliers of a plurality of unstructured data in the unstructured data set, obtaining a plurality of local outliers, combining data management function calculation to obtain a plurality of management parameters, deciding to obtain a plurality of management schemes to manage the unstructured data set, solving the technical problem that the prior art cannot adapt to orderly overall management of fluctuating multisource data, resulting in poor data management effect, counting unstructured data, extracting the document data and the graphic data, respectively configuring a targeted processing mode to process, configuring an adaptive management scheme based on the determined management parameters, and effectively managing the whole life cycle of the data.
Drawings
FIG. 1 is a flow chart of a method for unstructured data full lifecycle management;
FIG. 2 is a schematic diagram of an unstructured data adaptive metric set acquisition flow in a method for unstructured data full lifecycle management;
FIG. 3 is a schematic diagram of a plurality of management scheme acquisition flows in a method for unstructured data full lifecycle management;
FIG. 4 is a schematic diagram of a system architecture for unstructured data full lifecycle management.
Reference numerals illustrate: the system comprises a data statistics module 11, a data extraction module 12, a data standard acquisition module 13, a suitability analysis module 14, a local outlier analysis module 15, a management parameter calculation module 16 and a data management module 17.
Detailed Description
The application provides a method and a system for managing the full life cycle of unstructured data, which are used for counting unstructured data of a designated unit, extracting a graph data set and a document data set, determining the graph adaptation degree set and the document adaptation degree set by combining unstructured data standards, calculating the unstructured data adaptation degree set, analyzing local outliers of a plurality of unstructured data, combining data management functions to calculate and obtain a plurality of management parameters, deciding to acquire a plurality of management schemes to manage the unstructured data set, and solving the technical problem that the prior art cannot adapt to ordered overall management of fluctuating multisource data and the data management effect is poor.
Example 1
As shown in fig. 1, the present application provides a method for unstructured data full lifecycle management, the method being applied to an unstructured data management apparatus including a document extractor, a graphic extractor, and a processor, the method comprising:
s10: the unstructured data management device is accessed into a database of a designated unit through an interface, and all unstructured data to be managed in the designated unit are counted to obtain an unstructured data set;
in factory operation, along with updating iteration of a process, corresponding derivative data are synchronously updated, and generation of multi-source data, asynchronous data and the like causes data management difficulty. And counting unstructured data, extracting document data and graphic data, respectively configuring a targeted processing mode for processing, configuring an adaptive management scheme based on the determined management parameters, and effectively managing the whole life cycle of the data.
The interface is used for communicating the data end with the management end, and the unstructured data management device is accessed into a database of a designated unit based on the interface. Specifically, the application is used for the associated data management of the whole life cycle of updating, backup, calling, eliminating and the like of the operation standard book in a factory, the operation standard book of a product can be updated and replaced step by step along with the updating iteration of a process, and the management necessity of heterogeneous multi-source data exists, such as the maintenance, the destruction or the storage to be modified, the storage to be distributed and the like, wherein the operation standard book generally comprises text data and picture data. The database of the designated unit is a management storage database for the job standard book in the factory to be subjected to data management. And further, based on the accessed unstructured data management device, counting all unstructured data which are to be managed in a designated unit and do not have complete data structure rules, and integrating to obtain the unstructured data set, wherein the unstructured data set is a data source to be managed.
S20: extracting graphic data and document data in the unstructured data set through the graphic extractor and the document extractor, and integrating to obtain a graphic data set and a document data set;
specifically, the graphic extractor and the document extractor are tools for extracting graphics and text content, and based on the graphic extractor, the graphic data in the unstructured data set are identified, extracted and integrated with attribution to obtain the graphic data set; and based on the document extractor, carrying out identification extraction and attribution integration on the document data in the unstructured data set to acquire the document data set. And then, aiming at the data with different formats, configuring a targeted processing mode for analysis.
S30: the unstructured data standard in the appointed unit is obtained, wherein the unstructured data standard comprises a graph standard and a document standard, and the graph standard comprises a mapping relation of a plurality of graphs and a plurality of fitness degrees;
further, the method S30 further includes the steps of:
s31: performing applicability distribution according to a plurality of graphic data in the graphic data set to obtain a plurality of graphic adaptations;
s32: acquiring a plurality of project document sets of a plurality of job projects in unstructured data in the designated unit, and distributing the suitability of the project document sets to acquire a plurality of document suitability sets;
s33: carrying out keyword splitting and single-heat encoding treatment on the plurality of project document sets to construct a document encoding word bag;
s34: mapping the plurality of graphic data and the plurality of graphic applicability to obtain the graphic standard, and mapping the document coding word bag and the plurality of document applicability sets to obtain the document standard.
Specifically, traversing a plurality of graphic data in the graphic data set and distributing the fitness, and illustratively, taking the interval time based on the current time node as a constraint, distributing the fitness, wherein the more distant the graphic data is, the smaller the corresponding fitness is, analyzing the graphic data by the graphic data, and obtaining the plurality of graphic fitness. Further, based on unstructured data in the specified units, a plurality of operation items are extracted and document subdivision is performed, generally, one operation standard book comprises operation standard contents of a plurality of stations, documents are separated and are processed independently, the refinement and accuracy of processing results can be effectively improved, and a plurality of item document sets corresponding to the plurality of operation items are obtained. And then, carrying out applicability distribution on the plurality of project document sets respectively, wherein a specific distribution mode can be the same as the graph data, and the longer the generation time point of the document is, the lower the corresponding distribution applicability is, so as to obtain the plurality of document applicability sets.
Further, for the plurality of project document sets, each project document is split according to a sentence structure, a part of speech and the like, and independent heat coding processing is performed on each split keyword, specifically, for each split keyword of each project document, a coding position with a unique fixed register order is determined for each split keyword by taking the number of keywords as coding digits, elements corresponding to the coding position are marked as 1, elements of the rest coding positions are marked as 0, and discrete keywords extracted from each project document can be converted into continuous mapping codes, so that subsequent processing is simpler and ordered. And integrating the attribution and integration of the document to the processed code to obtain the document code word bag.
And mapping and associating the plurality of graphic data with the plurality of graphic fitness as the graphic standard; mapping the corresponding document coding word bags and the plurality of document applicability sets to serve as the document standard. And using the graphic standard and the document standard as unstructured data standard in the specified unit, namely standardized reference data, and performing applicability mapping analysis of the unstructured data.
S40: in the processor, according to the graph standard and the document standard, performing applicability analysis on the graph data set and the document data set to obtain a graph adaptation degree set and a document adaptation degree set, and calculating to obtain an unstructured data adaptation degree set;
further, as shown in fig. 2, according to the graphic standard and the document standard, the applicability analysis is performed on the graphic data set and the document data set, and the application S40 further includes:
s41: training a graph similarity identifier based on a twin network by adopting the graph data set;
s42: inputting a plurality of graphs in the graph data set into the graph similarity identifier by combining the graphs in the graph standard, obtaining a matched graph, and mapping to obtain the graph adaptation degree set;
s43: splitting keywords of a plurality of documents in the document dataset, performing single-heat coding conversion, combining the document coding word bags, and counting to obtain the document suitability set;
s44: and calculating the unstructured data adaptation degree set according to the graph adaptation degree set and the document adaptation degree set.
Further, based on the twin network, the graph data set is used to train the graph similarity identifier, and the application S41 further includes:
s411: based on the twin network, two pattern recognition networks with the same network architecture are constructed;
s412: constructing a loss function, wherein the loss function is represented by the following formula:
wherein LOSS is LOSS, M is the number of graphic data combinations for carrying out graphic data pairwise random combination according to the graphic data set, P represents whether two inputted graphic data are of the same category, 0 is zero, 1 is not, X and Y are two graphic data in the inputted ith group of graphic combinations,for the loss function when the two input graphic data are of the same class, < >>A loss function when two input graphic data are not of the same category;
s413: training the two pattern recognition networks according to the loss function, and sharing network parameters until convergence conditions are met, so as to obtain the pattern similarity recognizer.
Based on the processor, namely a function execution unit of an instruction, combining a twin network, and carrying out similarity matching of the graphic data and each graphic in the graphic standard by taking the graphic standard and the document standard as references, and determining a corresponding graphic adaptation degree set; and performing single-heat coding conversion on the document data, performing coding traversal matching with the document standard, and obtaining the mapping fitness of a matching result to be integrated as the document fitness set. And further carrying out weighted calculation on the graph adaptation degree set and the document adaptation degree set to obtain the unstructured data adaptation degree set.
Specifically, two pattern recognition networks with the same network structure are built by combining a twin network, the two pattern recognition networks are configured in parallel, and a loss function is further built, wherein the expression is as follows:wherein LOSS is LOSS, M is the number of graphic data combinations for carrying out graphic data two-by-two random combination according to the graphic data set, P represents whether two inputted graphic data are of the same category, 0 is 0, 1 is 1, X and Y are two graphic data in the ith group of graphic combinations inputted, and the number of graphic data combinations is equal to the number of graphic data combinations>For the loss function when the two input graphic data are of the same class, < >>For the loss function when the two input graphic data are not of the same type, the parameters can be determined based on the earlier processing and mapping comparison in the embodiment of the application.
Based on the loss function, combining the graph data sets, training two graph recognition networks for carrying out relative loss analysis and similarity assessment of the graph data to be compared. And simultaneously, network parameters of the two pattern recognition networks are shared until convergence conditions are met, for example, network training accuracy meets a threshold value standard, training is stopped, and the pattern similarity recognizer after training is obtained.
And further inputting a plurality of graphs in the graph data set and graphs in the graph standard into two graph recognition networks in the graph similarity recognizer respectively, performing graph similarity matching by performing feature extraction mapping and loss analysis of the graphs, determining the graph standard matched with each graph in the graph data set as the matched graph, and taking the graph adaptation degree mapped by the matched graph as the graph adaptation degree set corresponding to the graph data set.
And aiming at a plurality of documents in the document dataset, respectively carrying out keyword splitting and single-heat code conversion on each document, wherein the specific keyword splitting mode is the same as the single-heat code conversion mode, and obtaining the document codes corresponding to the plurality of documents after slow speed. And further traversing the document coding word bags to match, determining the matching codes of the document codes in the document coding word bags, and taking the document applicability mapped by the matching codes as the document applicability set corresponding to the document data set.
Further, traversing and weight configuration are performed on the graphic data set and the document data set, and weight distribution may be performed based on the data importance of the graphic data and the document data, for example. And extracting the graph fitness and the document fitness corresponding to the operation standard book, weighting and calculating, taking the weighting calculation result as the fitness of the corresponding unstructured data, and adding the weighting calculation result into the unstructured data fitness set.
S50: according to the unstructured data suitability set, analyzing local outliers of a plurality of unstructured data in the unstructured data set to obtain a plurality of local outliers;
further, according to the unstructured data fitness set, analyzing local outliers of a plurality of unstructured data in the unstructured data set, and the application S50 further comprises:
wherein ,k is the adaptation degree of unstructured data as a local outlierOutlier coefficient->And K is obtained by calculating the average value of the distances between the unstructured data fitness and the nearest Q unstructured data fitness, and N is the number of the unstructured data fitness in the unstructured data fitness set.
Specifically, for a plurality of unstructured fitness values in the unstructured data fitness set, the fitness difference value is calculated in pairs to serve as the distance between the fitness values. Further construct a local outlier expression:, wherein ,/>As local outliers, K is an outlier coefficient of unstructured data fitness, < ->And K is obtained by calculating the average value of distances from the nearest Q unstructured data fitness sets, N is the number of unstructured data fitness sets in the unstructured data fitness sets, and the parameters can be obtained based on known data statistics and calculation.
The larger the difference value is, the larger the local outlier is, which is the distance between the applicability of unstructured data and other applicability. And traversing the unstructured data adaptation degree sets by combining the expression, and respectively calculating local outliers of the unstructured data adaptation degree sets to obtain the plurality of local outliers. The local outlier is a measurement basis for determining the management parameters.
S60: substituting a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters;
further, substituting the plurality of fitness values and the plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters, and the application S60 further includes:
s611: constructing a data management function, wherein the following formula is as follows:
;
wherein PR is a management parameter, FIT is the applicability of unstructured data, ISO is a local outlier, and />Is the weight;
s62: substituting the multiple fitness values and the multiple local outliers in the unstructured data fitness set into a data management function for calculation to obtain multiple management parameters.
Specifically, a data management function expression is obtained:PR is a management parameter, FIT is the applicability of unstructured data, ISO is a local outlier, and ++> and />The parameters can be obtained based on the earlier processing of the embodiment of the present application, and the configuration weights can be configured by a person skilled in the art based on the management influence. Mapping and corresponding a plurality of fitness values in the unstructured data fitness value set with the plurality of local outliers, obtaining mapping results, inputting the mapping results into the data management function, calculating and obtaining management parameters corresponding to the mapping results, and integrating and obtaining the plurality of management parameters. And taking the plurality of management parameters as references to carry out decision acquisition of a management scheme.
S70: and the decision is made to acquire a plurality of management schemes, and the unstructured data set is managed, wherein the plurality of management schemes are acquired through decision making according to the plurality of management parameters.
Further, as shown in fig. 3, the decision to obtain a plurality of management schemes, the present application S70 further includes:
s71: acquiring a plurality of sample management schemes according to the data management processing channels of the appointed units;
s72: adding data noise to the plurality of management parameters to obtain a plurality of sample management parameters;
s73: taking the management parameters as decision features, and constructing a data management decision tree by adopting the plurality of sample management parameters and the plurality of sample management schemes based on a decision tree algorithm;
s74: and inputting the management parameters into the data management decision tree to obtain the management schemes.
Specifically, the plurality of sample management schemes are retrieved and acquired based on the data management processing channel of the brake unit, for example, a history processing record or the like. And adding data noise to the plurality of management parameters to acquire the plurality of sample management parameters, so as to ensure that the plurality of sample management parameters are attached to the plurality of management parameters and have randomness. Further, taking the management parameters as decision features, randomly determining one item as a decision node based on the plurality of sample management parameters, constructing a first decision layer, and classifying the plurality of sample management parameters based on the decision node; randomly extracting one item based on the plurality of sample management parameters again to serve as a decision node to construct a second decision layer, and dividing the classification result again; and the like, completing construction of an Nth decision layer until a maximum construction level is reached, carrying out level association of the first decision layer, the second decision layer and the Nth decision layer, carrying out level matching and identification of a sample management scheme based on the plurality of sample management schemes, and obtaining the constructed data management decision tree.
And further inputting the plurality of management parameters into the data management decision tree, performing hierarchical traversal matching, and taking a sample management scheme identified by a matching result as a management scheme for the adaptation of the plurality of management parameters, and extracting and integrating the sample management scheme to obtain the plurality of management schemes. And based on the management schemes, carrying out targeted management on the unstructured data sets and guaranteeing management effects.
The method for managing the full life cycle of unstructured data has the following technical effects:
1. and counting unstructured data, extracting document data and graphic data, configuring a targeted processing mode to perform similarity analysis by combining the graphic standard and the document standard, and ensuring the accuracy of a data analysis result.
2. Constructing a graph similarity identifier based on the twin network, and performing similarity analysis on graph data to determine the fitness; and carrying out keyword splitting and single-heat coding conversion on document data, carrying out ordered quantization of texts and standard mapping to determine the applicability, and improving the analysis precision of the applicability.
3. And carrying out local outlier analysis based on the suitability, and determining the measurement standard of the management parameters. And the characteristic quantization of the data parameters is carried out by combining the processing functions, so that visual analysis and processing are convenient. And taking the fitness and the local outlier as the standard to carry out management scheme decision, and ensuring the data fitness of the scheme.
Example two
Based on the same inventive concept as one of the methods for unstructured data full lifecycle management in the previous embodiments, as shown in fig. 4, the present application provides a system for unstructured data full lifecycle management, the system comprising:
the data statistics module 11 is configured to access the unstructured data management device to a database of a specified unit through an interface, and count all unstructured data to be managed in the specified unit to obtain an unstructured data set;
a data extraction module 12, where the data extraction module 12 is configured to extract, through the graphic extractor and the document extractor, graphic data and document data in the unstructured data set, and integrate to obtain a graphic data set and a document data set;
a data standard obtaining module 13, where the data standard obtaining module 13 is configured to obtain an unstructured data standard in the specified unit, where the unstructured data standard includes a graphics standard and a document standard, and the graphics standard includes a mapping relationship between a plurality of graphics and a plurality of fitness degrees;
the applicability analysis module 14 is configured to perform applicability analysis on the graphic dataset and the document dataset according to the graphic standard and the document standard in the processor, obtain a graphic applicability set and a document applicability set, and calculate to obtain an unstructured data applicability set;
the local outlier analysis module 15 is configured to analyze local outliers of a plurality of unstructured data in the unstructured data set according to the unstructured data fitness set, so as to obtain a plurality of local outliers;
the management parameter calculation module 16, where the management parameter calculation module 16 is configured to substitute a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function to perform calculation, so as to obtain a plurality of management parameters;
the data management module 17 is configured to make a decision to obtain a plurality of management schemes, and manage the unstructured data sets, where the plurality of management schemes are obtained by making a decision according to the plurality of management parameters.
Further, the data standard obtaining module 13 further includes:
the applicability distribution module is used for distributing the applicability according to a plurality of graphic data in the graphic data set to obtain a plurality of graphic adaptations;
the document applicability acquisition module is used for acquiring a plurality of project document sets of a plurality of job projects in unstructured data in the specified unit, and carrying out applicability distribution to acquire a plurality of document applicability sets;
the document coding word bag construction module is used for carrying out keyword splitting and single-heat coding treatment on the plurality of project document sets to construct a document coding word bag;
and the applicability mapping module is used for mapping the plurality of graphic data and the plurality of graphic applicability to obtain the graphic standard, and mapping the document coding word bag and the plurality of document applicability sets to obtain the document standard.
Further, the suitability analysis module 14 further includes:
the figure similarity recognizer training module is used for training the figure similarity recognizer by adopting the figure data set based on the twin network;
the image suitability acquisition module is used for inputting a plurality of images in the image data set into the image similarity identifier by combining the images in the image standard, acquiring a matched image and mapping to acquire the image suitability set;
the document applicability acquisition module is used for splitting keywords of a plurality of documents in the document data set, performing single-heat coding conversion, combining the document coding word bags, and obtaining the document applicability set through statistics;
and the data fitness calculation module is used for calculating the unstructured data fitness set according to the graph fitness set and the document fitness set.
Further, the graph similarity identifier training module further includes:
the pattern recognition network construction module is used for constructing two pattern recognition networks with the same network architecture based on the twin network;
the loss function construction module is used for constructing a loss function, and the loss function construction module is used for constructing a loss function according to the following formula:
wherein LOSS is LOSS, M is the number of graphic data combinations for carrying out graphic data pairwise random combination according to the graphic data set, P represents whether two inputted graphic data are of the same category, 0 is zero, 1 is not, X and Y are two graphic data in the inputted ith group of graphic combinations,for the loss function when the two input graphic data are of the same class, < >>A loss function when two input graphic data are not of the same category;
and the network training module is used for training the two pattern recognition networks according to the loss function and sharing network parameters until convergence conditions are met, so as to obtain the pattern similarity recognizer.
Further, the local outlier analysis module 15 further comprises:
wherein ,as local outliers, K is an outlier coefficient of unstructured data fitness, < ->And K is obtained by calculating the average value of the distances between the unstructured data fitness and the nearest Q unstructured data fitness, and N is the number of the unstructured data fitness in the unstructured data fitness set.
Further, the management parameter calculation module 16 further includes:
the data management function construction module is used for constructing a data management function, and the data management function construction module is as follows:
;
wherein PR is a management parameter, FIT is the applicability of unstructured data, ISO is a local outlier, and />Is the weight;
and the parameter acquisition module is used for substituting the multiple fitness values and the multiple local outliers in the unstructured data fitness set into a data management function for calculation to obtain multiple management parameters.
Further, the data management module 17 further includes:
the sample management scheme acquisition module is used for acquiring a plurality of sample management schemes according to the data management processing channels of the appointed units;
the sample management parameter acquisition module is used for adding data noise to the plurality of management parameters to acquire a plurality of sample management parameters;
the decision tree construction module is used for constructing a data management decision tree by taking management parameters as decision characteristics and adopting the plurality of sample management parameters and a plurality of sample management schemes based on a decision tree algorithm;
the management scheme acquisition module is used for inputting the plurality of management parameters into the data management decision tree to obtain the plurality of management schemes.
From the foregoing detailed description of a method for unstructured data full life cycle management, those skilled in the art can clearly understand that a method and a system for unstructured data full life cycle management in this embodiment are disclosed in the embodiments, and for the device disclosed in the embodiments, since the device corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for unstructured data full lifecycle management, wherein the method is applied to an unstructured data management apparatus, the apparatus comprising a document extractor, a graphics extractor, and a processor, the method comprising:
the unstructured data management device is accessed into a database of a designated unit through an interface, and all unstructured data to be managed in the designated unit are counted to obtain an unstructured data set;
extracting graphic data and document data in the unstructured data set through the graphic extractor and the document extractor, and integrating to obtain a graphic data set and a document data set;
the unstructured data standard in the appointed unit is obtained, wherein the unstructured data standard comprises a graph standard and a document standard, and the graph standard comprises a mapping relation of a plurality of graphs and a plurality of fitness degrees;
in the processor, according to the graph standard and the document standard, performing applicability analysis on the graph data set and the document data set to obtain a graph adaptation degree set and a document adaptation degree set, and calculating to obtain an unstructured data adaptation degree set;
according to the unstructured data suitability set, analyzing local outliers of a plurality of unstructured data in the unstructured data set to obtain a plurality of local outliers;
substituting a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters;
and the decision is made to acquire a plurality of management schemes, and the unstructured data set is managed, wherein the plurality of management schemes are acquired through decision making according to the plurality of management parameters.
2. The method of claim 1, wherein obtaining unstructured data criteria within the specified unit comprises:
performing applicability distribution according to a plurality of graphic data in the graphic data set to obtain a plurality of graphic adaptations;
acquiring a plurality of project document sets of a plurality of job projects in unstructured data in the designated unit, and distributing the suitability of the project document sets to acquire a plurality of document suitability sets;
carrying out keyword splitting and single-heat encoding treatment on the plurality of project document sets to construct a document encoding word bag;
mapping the plurality of graphic data and the plurality of graphic applicability to obtain the graphic standard, and mapping the document coding word bag and the plurality of document applicability sets to obtain the document standard.
3. The method of claim 2, wherein performing a suitability analysis of the graphic dataset and the document dataset according to the graphic standard and the document standard comprises:
training a graph similarity identifier based on a twin network by adopting the graph data set;
inputting a plurality of graphs in the graph data set into the graph similarity identifier by combining the graphs in the graph standard, obtaining a matched graph, and mapping to obtain the graph adaptation degree set;
splitting keywords of a plurality of documents in the document dataset, performing single-heat coding conversion, combining the document coding word bags, and counting to obtain the document suitability set;
and calculating the unstructured data adaptation degree set according to the graph adaptation degree set and the document adaptation degree set.
4. A method according to claim 3, wherein training a graph similarity identifier using the graph dataset based on a twinning network comprises:
based on the twin network, two pattern recognition networks with the same network architecture are constructed;
constructing a loss function, wherein the loss function is represented by the following formula:
wherein LOSS is LOSS, M is the number of graphic data combinations for carrying out graphic data pairwise random combination according to the graphic data set, P represents whether two inputted graphic data are of the same category, 0 is zero, 1 is not, X and Y are two graphic data in the inputted ith group of graphic combinations,for the loss function when the two input graphic data are of the same class, < >>A loss function when two input graphic data are not of the same category;
training the two pattern recognition networks according to the loss function, and sharing network parameters until convergence conditions are met, so as to obtain the pattern similarity recognizer.
5. The method of claim 1, wherein local outliers of a plurality of unstructured data within the unstructured data set are analyzed according to the unstructured data fitness set according to the formula:
wherein ,as local outliers, K is an outlier coefficient of unstructured data fitness, < ->And K is obtained by calculating the average value of the distances between the unstructured data fitness and the nearest Q unstructured data fitness, and N is the number of the unstructured data fitness in the unstructured data fitness set.
6. The method of claim 1, wherein substituting the plurality of fitness values and the plurality of local outliers within the unstructured data fitness set into a data management function for calculation results in a plurality of management parameters, comprising:
constructing a data management function, wherein the following formula is as follows:
;
wherein PR is a management parameter, FIT is the applicability of unstructured data, ISO is a local outlier, and />Is the weight;
substituting the multiple fitness values and the multiple local outliers in the unstructured data fitness set into a data management function for calculation to obtain multiple management parameters.
7. The method of claim 1, wherein deciding to acquire a plurality of management schemes comprises:
acquiring a plurality of sample management schemes according to the data management processing channels of the appointed units;
adding data noise to the plurality of management parameters to obtain a plurality of sample management parameters;
taking the management parameters as decision features, and constructing a data management decision tree by adopting the plurality of sample management parameters and the plurality of sample management schemes based on a decision tree algorithm;
and inputting the management parameters into the data management decision tree to obtain the management schemes.
8. A system for unstructured data full lifecycle management, wherein the system is applied to an unstructured data management apparatus, the apparatus comprising a document extractor, a graphics extractor, and a processor, the system comprising:
the data statistics module is used for accessing the unstructured data management device into a database of a designated unit through an interface, and counting all unstructured data to be managed in the designated unit to obtain an unstructured data set;
the data extraction module is used for extracting the graphic data and the document data in the unstructured data set through the graphic extractor and the document extractor, and integrating the graphic data set and the document data set;
the data standard acquisition module is used for acquiring unstructured data standards in the specified units, wherein the unstructured data standards comprise graphic standards and document standards, and the graphic standards comprise mapping relations of a plurality of graphics and a plurality of fitness degrees;
the applicability analysis module is used for carrying out applicability analysis on the graphic data set and the document data set in the processor according to the graphic standard and the document standard to obtain a graphic applicability set and a document applicability set, and calculating to obtain an unstructured data applicability set;
the local outlier analysis module is used for analyzing local outliers of a plurality of unstructured data in the unstructured data set according to the unstructured data suitability set to obtain a plurality of local outliers;
the management parameter calculation module is used for substituting a plurality of fitness values and a plurality of local outliers in the unstructured data fitness set into a data management function for calculation to obtain a plurality of management parameters;
and the data management module is used for deciding to acquire a plurality of management schemes and managing the unstructured data set, wherein the plurality of management schemes are acquired through decision making according to the plurality of management parameters.
CN202311031481.8A 2023-08-16 2023-08-16 Method and system for unstructured data full life cycle management Active CN116737884B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311031481.8A CN116737884B (en) 2023-08-16 2023-08-16 Method and system for unstructured data full life cycle management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311031481.8A CN116737884B (en) 2023-08-16 2023-08-16 Method and system for unstructured data full life cycle management

Publications (2)

Publication Number Publication Date
CN116737884A CN116737884A (en) 2023-09-12
CN116737884B true CN116737884B (en) 2023-10-10

Family

ID=87919089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311031481.8A Active CN116737884B (en) 2023-08-16 2023-08-16 Method and system for unstructured data full life cycle management

Country Status (1)

Country Link
CN (1) CN116737884B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033168A (en) * 2018-06-20 2018-12-18 江苏网域科技有限公司 A kind of data processing system based on big data
CN113657605A (en) * 2020-05-12 2021-11-16 埃森哲环球解决方案有限公司 Document processor based on artificial intelligence AI

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161249A1 (en) * 2013-12-05 2015-06-11 Lenovo (Singapore) Ptd. Ltd. Finding personal meaning in unstructured user data
US10846341B2 (en) * 2017-10-13 2020-11-24 Kpmg Llp System and method for analysis of structured and unstructured data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033168A (en) * 2018-06-20 2018-12-18 江苏网域科技有限公司 A kind of data processing system based on big data
CN113657605A (en) * 2020-05-12 2021-11-16 埃森哲环球解决方案有限公司 Document processor based on artificial intelligence AI

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
奚建飞 等."基于深度学习的非结构化表格文档数据抽取方法".《微型电脑应用》.2022,第38卷(第2期),第102-105页. *

Also Published As

Publication number Publication date
CN116737884A (en) 2023-09-12

Similar Documents

Publication Publication Date Title
EP3499384A1 (en) Word and sentence embeddings for sentence classification
Dias et al. Using the Choquet integral in the pooling layer in deep learning networks
CN116719520B (en) Code generation method and device
CN112085565A (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN110222192A (en) Corpus method for building up and device
CN113821587B (en) Text relevance determining method, model training method, device and storage medium
CN113656699B (en) User feature vector determining method, related equipment and medium
CN108122613B (en) Health prediction method and device based on health prediction model
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN116737884B (en) Method and system for unstructured data full life cycle management
Ferreira et al. Evaluating human-machine translation with attention mechanisms for industry 4.0 environment SQL-based systems
CN116522912A (en) Training method, device, medium and equipment for package design language model
CN114943285B (en) Intelligent auditing system for internet news content data
Riesener et al. Methodology for Automated Master Data Management using Artificial Intelligence
CN114048854B (en) Deep neural network big data internal data file management method
KR101839121B1 (en) System and method for correcting user&#39;s query
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
CN113571198A (en) Conversion rate prediction method, device, equipment and storage medium
Falzone et al. Measuring similarity for technical product descriptions with a character-level siamese neural network
CN110737781A (en) law and fact relation calculation method based on multi-layer knowledge
CN117235137B (en) Professional information query method and device based on vector database
CN117453805B (en) Visual analysis method for uncertainty data
CN116737727B (en) Stock transaction data column type storage method and server based on tree structure
CN117077680A (en) Question and answer intention recognition method and device
CN114155016A (en) Click rate estimation method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant