CN109344298A - Method and device for converting unstructured data into structured data - Google Patents

Method and device for converting unstructured data into structured data Download PDF

Info

Publication number
CN109344298A
CN109344298A CN201811289109.6A CN201811289109A CN109344298A CN 109344298 A CN109344298 A CN 109344298A CN 201811289109 A CN201811289109 A CN 201811289109A CN 109344298 A CN109344298 A CN 109344298A
Authority
CN
China
Prior art keywords
data
unstructured data
transformed
unstructured
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811289109.6A
Other languages
Chinese (zh)
Inventor
黄文琦
明哲
许爱东
滑春波
陈华军
杨航
关泽武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China South Power Grid International Co ltd
China Southern Power Grid Co Ltd
Original Assignee
China South Power Grid International Co ltd
China Southern Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China South Power Grid International Co ltd, China Southern Power Grid Co Ltd filed Critical China South Power Grid International Co ltd
Priority to CN201811289109.6A priority Critical patent/CN109344298A/en
Publication of CN109344298A publication Critical patent/CN109344298A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for converting unstructured data into structured data, which can convert file titles, storage addresses, file index information and the like of unstructured data to be converted into structured data, and can extract first target information related to the content of the unstructured data in the unstructured data to be converted according to an algorithm model corresponding to the type of the unstructured data to be converted; and then the first target information is converted into structured data according to a predefined rule, and further the content of the unstructured data can be converted into the structured data. The unstructured data to be converted can be subjected to structured conversion in multiple aspects, and the unstructured data can be searched or managed by utilizing the content of the unstructured data, so that the visualization and query efficiency of the unstructured data is improved, and the management difficulty is reduced. In addition, the invention also discloses a device for converting unstructured data into structured data, and the effect is as above.

Description

A kind of method and device converting unstructured data to structural data
Technical field
The present invention relates to data types to convert field, in particular to a kind of to convert structural data for unstructured data Method and device.
Background technique
Data type used at present mainly includes three kinds of structures: structural data this category information can with data or uniformly Structure indicated, and store in the database, have certain road structure, can be indicated with bivariate table.It is unstructured This category information of data refers to that data structure is not fixed, and the data of two-dimensional data table representation can not be used, such as document, image and view Frequently.Semi-structured data is a kind of data mode (such as XML, document) between structural data and unstructured data, It is the structured data of tool, but structure change is very big.
The structural data of three types is characterized in: structural data is easily managed, search efficiency is high, reliability is high, It is very low that permission control, management cost can be increased.Structural data is frequently stored in relevant database, can allow use Person is more convenient, more efficient searches.But drawback maximum for structural data is exactly to be not easy to extend, and is had solid Fixed format, template increase extremely difficult when data attribute.The importance of semi-structured data is highlighted increasingly, mainly Because of its flexibility, semi-structured data is " non-mode ", and data are self-described, and are associated with the letter of its mode Breath, this mode can arbitrarily change in centralized database at any time.And for unstructured data, although having good Scalability and enough flexibly, but faces very big difficulty in data management, query aspects, so separation structure data are turned Turning to structural data is particularly important.
Presently mainly the file title of nonstructured data type, storage address and mark etc. can be represented non-structural The main information of data type is converted, and the structural datas such as file title, storage address and mark after conversion are utilized It goes lookup or manages the file of nonstructured data type.But this transform mode is relatively simple, the file after will lead to conversion Content be still the file being made of nonstructured data type, however it remains visualization and problem of management, and using existing There are this transform mode management difficulty in technology and inquiry difficulty larger.
It can be seen that how to overcome the mode for being converted into structural data due to unstructured data single, and then cause Unstructured data effect of visualization difference and inquiry and the big problem of management difficulty be that those skilled in the art are urgently to be resolved The problem of.
Summary of the invention
The embodiment of the present application provides a kind of method and device for converting unstructured data to structural data, with solution The mode for being certainly converted into structural data due to unstructured data in the prior art is single, and then caused unstructured data Effect of visualization difference and inquiry and the big problem of management difficulty.
In order to solve the above technical problems, the present invention provides a kind of sides for converting unstructured data to structural data Method, including converting structural data for the target information of unstructured data to be transformed, wherein the target information is at least Including file title, storage address and the file index information in addition to the content of the unstructured data to be transformed, It is characterized by further comprising:
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed Unstructured data the corresponding first object information of content;
Structural data is converted with by the non-knot to be transformed by the first object information according to predefined rule The content transformation of structure data is structural data.
Preferably, when the type of the unstructured data to be transformed is text file, the algorithm model is specific For LDA topic model.
Preferably, algorithm model corresponding to type of the foundation with the unstructured data to be transformed extract with The corresponding first object information of the content of the unstructured data to be transformed specifically includes:
Determine the prior probability of each data in the content of the text file;
The similarity of each data in the content of the text file is calculated according to the prior probability;
The type or semanteme of each data in the content of the text file are determined according to the similarity, and using cluster Algorithm carries out clustering processing to the data of same type or identical semanteme to obtain the first object information.
Preferably, when the type of the unstructured data to be transformed is image file or video file, the calculation Method model is specially deep neural network model.
Preferably, algorithm model corresponding to type of the foundation with the unstructured data to be transformed extract with The corresponding first object information of the content of the unstructured data to be transformed specifically:
The first object letter is extracted using the RBF radial basis function neural network in the deep neural network model Breath.
Preferably, described to use the depth when the type of the unstructured data to be transformed is image file RBF radial basis function neural network in neural network model extracts the first object information and specifically includes:
Processing is split to described image and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted Feature carries out clustering to obtain the second target information, and using second target information as the first object information.
Preferably, described to use the depth when the type of the unstructured data to be transformed is video file RBF radial basis function neural network in neural network model extracts the first object information and specifically includes:
The content of the video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;
Each sub-video is converted into subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted Feature carries out clustering to obtain third target information, and using the third target information as the first object information.
Preferably, described structural data is converted for the first object information according to predefined rule to specifically include:
The first object information is converted into half hitch according to the file template with the unstructured data to be transformed Structure data;
MapReduce parallel processing is carried out to the semi-structured data;
Structural data is converted by the semi-structured data after the MapReduce parallel processing using XML technology.
In order to solve the above technical problems, being converted into structural data the present invention also provides one kind and by unstructured data The corresponding device for converting unstructured data to structural data of method, including first structure conversion module, for will The target information of unstructured data to be transformed is converted into structural data, wherein the target information, which includes at least, removes institute State file title, storage address and the file index information except the content of unstructured data to be transformed, further includes:
Extraction module, for according to the extraction of algorithm model corresponding to the type with unstructured data to be transformed and institute State the corresponding first object information of content of unstructured data to be transformed;
Second thaumatropy module, for converting structural data for the first object information according to predefined rule Using by the content transformation of the unstructured data to be transformed as structural data.
In order to solve the above technical problems, being converted into structuring number the present invention also provides another kind and by unstructured data According to the corresponding device for converting unstructured data to structural data of method, comprising:
Memory, for storing computer program;
Processor, for executing the computer program to realize that unstructured data is converted knot by any one of the above The step of structure data.
Compared with the prior art, a kind of side converting unstructured data to structural data provided by the present invention Method, in addition to that can convert the file title of unstructured data to be transformed, storage address and file index information etc. to Except structural data, it can also extract according to algorithm model corresponding to the type of unstructured data to be transformed wait turn First object information relevant to the unstructured data content to be transformed in the unstructured data of change;Further according to predefined First object information is converted structural data by rule, and then can be structuring number by the content transformation of unstructured data According to.Structuring conversion can be carried out to unstructured data to be transformed from many aspects, utilize the content of unstructured data Unstructured data can also be searched or be managed to information, and visualization and the search efficiency, reduction management of unstructured data can be improved Difficulty.In addition, the present invention also provides a kind of device for converting unstructured data to structural data, effect is as above.
Detailed description of the invention
Fig. 1 is a kind of method flow for converting unstructured data to structural data provided by the embodiment of the present invention Figure;
Fig. 2 is a kind of device composition for converting unstructured data to structural data provided by the embodiment of the present invention Schematic diagram;
Fig. 3 is the device group that another kind provided by the embodiment of the present invention converts unstructured data to structural data At schematic diagram.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art without making creative work it is obtained it is all its Its embodiment, shall fall within the protection scope of the present invention.
Core of the invention is to provide a kind of method and device for converting unstructured data to structural data, can be with It is single to solve the mode that structural data is converted into due to unstructured data, and then caused unstructured data visualization effect Fruit difference and inquiry and the big problem of management difficulty.
Scheme in order to enable those skilled in the art to better understand the present invention, with reference to the accompanying drawing and specific embodiment party The present invention is described in further detail for formula.
Fig. 1 is a kind of method flow for converting unstructured data to structural data provided by the embodiment of the present invention Figure, as shown in Figure 1, method includes the following steps:
S101: structural data is converted by the target information of unstructured data to be transformed, wherein target information is extremely It less include file title, storage address and the file index information in addition to the content of unstructured data to be transformed.
Particularly as be can be with by the file title of nonstructured data type to be transformed, storage address and index information etc. The main information for representing nonstructured data type to be transformed is converted, file title, storage ground after can use conversion The structural datas such as location and index information go to search or manage unstructured data.Unstructured data is exactly file in fact, Such as picture, video etc..
S102: according to algorithm model corresponding to type with unstructured data to be transformed extract with it is to be transformed non- The corresponding first object information of the content of structural data.
Particularly as being different types of unstructured data to be transformed, need to extract and be somebody's turn to do using different algorithm models The corresponding first object information of type unstructured data content to be transformed, first object information are exactly to be transformed non-in fact Key message in structural data content.
S103: structural data is converted for first object information with will be to be transformed unstructured according to predefined rule The content transformation of data is structural data.
After extracting the key message in unstructured data content to be transformed, so that it may using predefined rule by the One target information be converted into structural data achieve that unstructured data content to be transformed structuring conversion, using to The unstructured data content of conversion can be inquired or be managed to unstructured data to be transformed, and then can be improved non- The search efficiency of structural data, the management difficulty for reducing unstructured data, and can be in order to visualizing.In practical application In, the sequencing that step S101 and step S102 are not carried out can first carry out step S101, can also first carry out step S102, under the premise of conditions permit, step S101 and step S102 be may also be performed simultaneously.That is, to be transformed The content of unstructured data carry out structuring conversion with to unstructured data to be transformed file title, storage address with And file index information etc. carries out structuring conversion, the sequencing that the two is not carried out, the tool of step S101 and step S102 The body execution sequence present invention is simultaneously not construed as limiting.Unstructured data to be transformed in the embodiment of the present application mainly includes text text Part, picture file and video file.It is transformed to the structuring of text file, picture file and video file separately below Journey is described in detail.
First, the type of unstructured data to be transformed is the structuring conversion process of text file.
In order to ensure the coverage rate and accuracy of the text file content keyword of extraction, on the basis of above-described embodiment On, preferably embodiment, when the type of unstructured data to be transformed is text file, algorithm model is specially LDA topic model.
In order to further increase the extraction coverage rate and accuracy of text file content keyword, when using LDA theme mould When type carries out structuring conversion to the content of text file, on the basis of the above embodiments, preferably embodiment, according to According to algorithm model corresponding to the type of unstructured data to be transformed extract in unstructured data to be transformed Hold corresponding first object information to specifically include:
Determine the prior probability of each data in the content of text file;
The similarity of each data in the content of text file is calculated according to prior probability;
The type or semanteme of each data in the content of text file are determined according to similarity, and using clustering algorithm to phase Same type or the data of identical semanteme carry out clustering processing to obtain first object information.
The first step extracts the key word information (first object information) in text file content.Keyword extraction is from text This document concentrates the step for extracting reliable significant word or phrase key, it affects subsequent step.These words or Person's phrase often has fixed structure, and Topic word is more significant, and semantic height is identical, and field belongs to the spies such as characteristic remarkable Point is usually used to describe the classification informations such as field associated topic, the knowledge.Therefore the content information of text file extracts referred to as whole The most key and basic step as soon as a content of text is classified, without accurately, comprehensively extracting keyword and having no idea to protect Demonstrate,prove the architectonic coverage rate of entire text information and accuracy.
Accurate and significance using the more other model extraction key word informations of LDA topic model is higher, LDA theme mould Type essence is to use three layers of bayesian probability model, includes word, theme, document three-decker, is a kind of non-supervisory engineering Habit technology, LDA topic model can be used to identify extensive document sets, the main information in corpus, first have to obtain this article The prior probability of each data in this document, and every text is considered as by a word frequency vector using bag of words method, in this way Facilitate the mathematical model for converting the word frequency vector in text to and being easy to model, but bag of words do not consider text file Sequence in content between data and data, therefore according to the probability sorting that can obtain some data after the modeling of word frequency vector.
Second step calculates the similarity of each data in text file content.
Word or phrase are the minimum units for constituting a word or an article, are to carry out certainly for Word similarity The premise and basis of right Language Processing and text knowledge mining.Word similarity can be large batch of data information Match, the tasks such as search engine quick response user provide most important technical support.It can be according to huge building of corpus one A key words similarity Automatic computing system, and then determine the similarity mode algorithm of optimal policy.
Third step determines the type or semanteme of each data in text file content, and using clustering algorithm to same type Or the data of identical semanteme carry out clustering processing to obtain the second target information.
Specifically, the data for being suitable for same type or identical semanteme are subjected to clustering processing, it is possible to reduce huge meter Calculation amount, the semantic relation between each data of quantification treatment are obtained based on clustering algorithm (such as K-Means cluster, hierarchical clustering) Hierarchical relationship between each data further sorts out data, and the content in text file can be melted into several classes for showing theme Word, and then obtain the second target information.The embodiment of the present application can efficiently and accurately be extracted in text file content Key word information.
The first object information obtained is converted structural data text file by 4th step.
Utilize optimal text key word information extracting method, optimal keyword similarity mode algorithm and optimum cluster The result that parser is got, which is combined into formalization method, indicates that conversion according to certain rules, processing are converted into half hitch Structure data.The process of unstructured data rotation structure data specifically will introduce below, and wouldn't repeat herein.
Second, the type of unstructured data to be transformed is the structuring conversion process of image file or video file.
In order to improve the extraction accuracy of file content key message, on the basis of the above embodiments, preferably Embodiment, when the type of unstructured data to be transformed is image file or video file, the specific depth of algorithm model Neural network model.In order to improve the extraction efficiency of file content key message, when use deep neural network model is to image When the content of file or video file carries out structuring conversion, preferably embodiment, according to it is to be transformed non-structural Change algorithm model corresponding to the type of data and extracts first object letter corresponding with the content of unstructured data to be transformed Breath specifically:
First object information is extracted using the RBF radial basis function neural network in deep neural network model.
Deep neural network model is a novel Artificial Neural Network, has local sensing region, level knot It the features such as overall situation training that structure, feature extraction and assorting process combine, is had been widely used in field of image recognition.? Several filter layers with different size feature extraction are constructed in the algorithm model, these models are applied to image and are known In other problem.The characteristics of feeling grateful region extraction feature according to stratification possessed by deep neural network model and part, is fitted When the quantity that increase has perceived, the feature quantity and quality that each layer can extract in network can be improved, to improve depth The recognition capability of neural network model, and there is better robustness.
In order to improve the extraction efficiency of key message, the field to every frame image is limited in the embodiment of the present application, is contracted Small semantic interval existing between low-level features and level concepts.Therefore by support vector machines as model learning, in model Core uses gaussian radial basis function, so that radial basis function classifiers are obtained, because sample can be mapped to one by RBF core The space of more higher-dimension, it can handle the non-linear relation between picture tag and feature, be the very strong kernel function of locality, tool There are quite high flexibility and most popular kernel function.Parameter regulation, good ginseng are considered when using RBF kernel function Number can make classifier correctly predict unknown data, obtain high training accuracy, i.e., classifier prediction class label is correct Rate.
A kind of method converting structural data for unstructured data provided by the present invention, in addition to can will wait turn File title, storage address and file index information of the unstructured data of change etc. are converted into except structural data, also Unstructured data to be transformed can be extracted according to algorithm model corresponding to the type of unstructured data to be transformed In first object information relevant to the unstructured data content to be transformed;First object is believed further according to predefined rule Breath is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from many aspects Structuring conversion is carried out to unstructured data to be transformed, can also search or manage using the content information of unstructured data Visualization and the search efficiency, reduction management difficulty of unstructured data can be improved in unstructured data.
In order to further increase the extraction rate to key message in content of image files, on the basis of above-described embodiment On, preferably embodiment, when the type of unstructured data to be transformed is image file, using depth nerve net RBF radial basis function neural network in network model extracts first object information and specifically includes:
Processing is split to image file and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using RBF radial basis function neural network, and extracted feature is carried out Clustering is to obtain the second target information, and using the second target information as first object information.
Particularly as being that image file is divided into multiple word images according to partitioning parameters, feature then is carried out to each subgraph It extracts, i.e., extracts key message from each subgraph, finally obtain the key message extracted in each subgraph progress clustering Second target information out, and using the second target information as the key message (first object information) of final image file content. Feature selecting and the basic task of extraction are the most effective features of multiple features concentration selection of comforming during image characteristics extraction. So-called characteristics of image is effectively mapping image itself to be extracted from color image, but be different from the spy of other characteristics of image Point.The feature of more effective basis of characterization can be obtained by feature extraction, and reduces the dimension of metric space, by image Identification, which is placed in the feature space of low-dimensional, to be carried out, and the identification quality of image is greatly improved.
On the basis of the above embodiments, preferably embodiment, when the type of unstructured data to be transformed When for video file, first object letter is extracted using the RBF radial basis function neural network in the deep neural network model Breath specifically includes:
The content of video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;
Each sub-video is converted into subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using RBF radial basis function neural network, and extracted feature is carried out Clustering is to obtain third target information, and using third target information as first object information.
Particularly as being that video file is first divided into multiple sub-videos, then multiple sub-videos are passed through to the side analyzed frame by frame Formula is converted to subgraph, that is, is converted to image file, finally extracts the third target information of subgraph, and by third target information As first object information.Namely the processing mode of video file can be equal to the processing mode of image file.
Video file is not only the maximum data of storage size, while being also most typical isomery big data, video file Different data types is corresponded in different processing stages: unstructured data (video, image), semi-structured data are (special Sign), structuring (feature vector, description attribute).The method of video image data processing is gradually to convert unstructured data Statistics and association process analysis are then done for semi-structured data, are finally converted to structural data storage in the database.
Video data label helps to extract the Stability and veracity of the content and description in video file, so that view The parser of frequency content is more targeted, more more detailed better to the structural description of video file content in principle, still This is very harsh to video attribute label requirement, therefore the scene of video file is drawn according to color, scene, time etc. Point.Video file data has contained largely unstructured data, and video content excavation is by decoding to video file It is analyzed frame by frame afterwards.Video file is split according to parameter frame number, number first.Partitioning parameters are to carry out video text The premise and basis of part processing accuracy and reliability.Sub-video file carries out attribute label, and the quality of attribute label is directly It influences whether to the comprehensive of the structural description of video content.
It is exactly that the process of information extraction is carried out to every frame picture in the essence to video file analytic process frame by frame.Video text The part that a certain frame picture of a certain frame picture and previous moment in part compares notable difference is the main right of detection As determining potential difference section by background modeling, to Target Segmentation algorithm.In order to accelerate the rapid convergence of training pattern, It can generally be pre-processed before image recognition, include: remove noise, input dimensionality reduction data and delete unrelated data etc..
In order to improve conversion rate, on the basis of the above embodiments, preferably embodiment, according to predefined rule Then structural data is converted by first object information to specifically include:
First object information is converted into semi-structured data according to the file template with unstructured data to be transformed;
MapReduce parallel processing is carried out to semi-structured data;
Structural data is converted by the semi-structured data after MapReduce parallel processing using XML technology.
Particularly as being that will convert according to certain rules, handle by the key message that classification generates in unstructured content It is converted into semi-structured data.Either text file, image file or video file are extracting in respective file After the key message of appearance, structuring conversion, semi-structured data performance can be carried out by the mode in the embodiment of the present application Form is generally stored by XML file, that is, by the key message (first object information) of extraction, i.e., to unstructured number According to progress XMLization.To achieve the purpose that unstructured data is managed using XML.For the XMLization processing method of text file, The all included conversion function of the more recent version of Microsoft Office or tool, can easily by Office series documentation to The conversion of XML document.User can also according to their own needs, and the content and structure of WORD document writes phase in analysis power domain The program answered exports suitable XML document using oneself XSLT is suitble to.In addition, can also be used for by some special tools These documents are converted into XML document.For the XMLization processing method of picture, video, audio files, corresponding XML text is established Shelves record the key message extracted in the file contents such as picture, video, sound, animation, are needing to use these files When, it can be searched and be screened according to the content in XML document.And according to being recorded in document and respective file content Relevant key message is called.It that is can be by the content information of unstructured data to related unstructured number According to being inquired.As the document of text class can should mutually have conversion program using converting or writing step by step according to the characteristics of document It is converted into XML document, other types of document carries out it linking by the method for XML document storage object properties etc..
XML file mainly has a following characteristics: first, simplicity, entire XML document has stringent format to define, whole Body seems concise.Second, it is open, XML standard itself and document be on WEB it is wide-open, anyone can Free reading specification, used label and text.Third, efficiently and expandability, XML support multiplexing document segment, make User can create and using oneself label, can also be shared with other people, and extendability is larger.4th, high universalizable, XML tool There is unified sexual function, supports most of spoken and written languages in the world.Unstructured data is converted by XML data structuring After XML document, the management of unstructured data is transferred in the management to XML document.And the management to XML data, Industry has the way to manage and method of comparative maturity, so that the management of unstructured data also becomes easy.XML data It is typical semi-structured data, by the mapping for establishing XML and relational database.It converts, handle according to certain rules, it can It is converted into being supported by traditional database based on relational model for structural data.
But in practical applications, because unstructured data has multiple types, it is converted into semi-structured XML File also has multiple types, also becomes more and more big therewith as data volume increases XML file quantity.Because of XML file category In semi-structured data, these factors make the query processing that XML file is not suitable for using the relevant database of structure.Cause This before XML file is converted into structural data, carries out XML file using MapReduce in the embodiment of the present application Parallel processing, MapReduce are a distributed computing frameworks, are applied in big data development platform Hadoop, which can It being deployed in cheap PC cluster, data can be distributed each node in the cluster, thus realize the parallel processing of data, thus MapReduce is used for the data query of XML.XML is defined inside DTD the element inventories of all Doctypes, attribute, Label, the entity in document and its correlation.DTD is also that XML document structure has formulated set of rule.Carrying out document sum number According to library conversion when, DTD document can be made full use of, to set up the database structure for more meeting original text shelves, and will Information in document all being stored in database as far as possible.
The DTD process for generating a relational structures is illustrated below:
The first step obtains the data relationship table between all data item and data item according to DTD document.It is calculated by correlation Method, so that it may all elements in DTD document and their essential information all be saved in data structure, then resettle one pair The tables of data for answering this structure, stores information in relational database, thereby realizes in XML unstructured data to relationship The first step of structural data conversion in database.
Second step is established the main table of database and sublist according to the data structure being established above according to data relationship table, is being looked for To after basic element.Corresponding relation database table (essential information of reflection XML document) is set up, this table is referred to as basic Table.Table name is elementary name, and the field in base table is basic literary name section by basic element.According to object information, establish All main table and sublist are played, and saves location information of the corresponding element in XML document.
Third step includes the uncertain sublist of additional character daughter element according to the building of the different meanings of additional character.To this A little element is numbered, and determines a number upper limit to distinguish them.But it if it does, can be brought to database greatly The data of amount cause a large amount of disk to waste.So can by the case where the uncertain element of a table is few them It is divided into different records to save.
4th step realizes that the data in XML document are converted to relational database.On the basis of establishing database.By XML The data conversion of element form in document is at the data for recording form in relational database.By above step realize from Conversion of the XML document to relational database.To realize management of the XML to unstructured data.
It has been carried out in detail above for a kind of embodiment of method for converting unstructured data to structural data Description, a kind of method for converting structural data for unstructured data described based on the above embodiment, the present invention are implemented Example additionally provides a kind of device for converting unstructured data to structural data corresponding with this method.Due to device part Embodiment corresponded to each other with the embodiment of method part, therefore the embodiment of device part please refers to the embodiment of method part Description, which is not described herein again.
Fig. 2 is a kind of device composition for converting unstructured data to structural data provided by the embodiment of the present invention Schematic diagram, as shown in Fig. 2, the device includes first structure conversion module 201, extraction module 201 and the second thaumatropy module 203。
First structure conversion module 201, for converting structuring for the target information of unstructured data to be transformed Data, wherein target information is including at least the file title in addition to the content of unstructured data to be transformed, storage address And file index information.
Extraction module 202, for being extracted according to algorithm model corresponding to the type with unstructured data to be transformed First object information corresponding with the content of unstructured data to be transformed;
Second thaumatropy module 203, for converting structural data for first object information according to predefined rule Using by the content transformation of unstructured data to be transformed as structural data.
A kind of device converting unstructured data to structural data provided by the present invention, in addition to can will wait turn File title, storage address and file index information of the unstructured data of change etc. are converted into except structural data, also Unstructured data to be transformed can be extracted according to algorithm model corresponding to the type of unstructured data to be transformed In first object information relevant to the unstructured data content to be transformed;First object is believed further according to predefined rule Breath is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from many aspects Structuring conversion is carried out to unstructured data to be transformed, can also search or manage using the content information of unstructured data Visualization and the search efficiency, reduction management difficulty of unstructured data can be improved in unstructured data.
Retouch in detail above for a kind of embodiment of the method for converting structural data for unstructured data It states, a kind of method for converting structural data for unstructured data described based on the above embodiment, the embodiment of the present invention Additionally provide another device for converting unstructured data to structural data corresponding with this method.Due to device part Embodiment corresponded to each other with the embodiment of method part, therefore the embodiment of device part please refers to the embodiment of method part Description, which is not described herein again.
Fig. 3 is the device group that another kind provided by the embodiment of the present invention converts unstructured data to structural data At schematic diagram, as shown in figure 3, the device includes memory 301 and processor 302.
Memory 301, for storing computer program;
Processor 302, realizing for executing computer program will be non-structural provided by any one above-mentioned embodiment Change the step of data are converted into structural data.
Another kind provided by the present invention converts unstructured data in the device of structural data, in addition to can will be to File title, storage address and file index information of the unstructured data of conversion etc. are converted into except structural data, Unstructured number to be transformed can also be extracted according to algorithm model corresponding to the type of unstructured data to be transformed The first object information relevant to the unstructured data content to be transformed in;Further according to predefined rule by first object Information is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from multiple sides Structuring conversion is carried out in face of unstructured data to be transformed, can also search or manage using the content information of unstructured data Unstructured data is managed, visualization and the search efficiency, reduction management difficulty of unstructured data can be improved.
Above to it is provided by the present invention it is a kind of by unstructured data be converted into the method and device of structural data into It has gone and has been discussed in detail.With several examples, principle and implementation of the present invention are described herein, above embodiments Explanation, be merely used to help understand method and its core concept of the invention;Meanwhile for the general technology people of this field Member, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this explanation Book content should not be construed as limiting the invention, those skilled in the art, under the premise of no creative work, to this hair Bright made modification, equivalent replacement, improvement etc., should be included in the application.
It should also be noted that, in the present specification, relational terms such as first and second and the like be used merely to by One operation is distinguished with another operation, without necessarily requiring or implying there are any between these entities or operation This actual relationship or sequence.Moreover, the similar word such as term " includes ", so that including the unit of a series of elements, equipment Or system not only includes those elements, but also including other elements that are not explicitly listed, or further includes for this list Member, equipment or the intrinsic element of system.

Claims (10)

1. a kind of method for converting structural data for unstructured data, including by the mesh of unstructured data to be transformed Mark information is converted into structural data, wherein the target information is included at least except the unstructured data to be transformed File title, storage address and file index information except content, which is characterized in that further include:
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non- The corresponding first object information of the content of structural data;
Structural data is converted with will be described to be transformed unstructured by the first object information according to predefined rule The content transformation of data is structural data.
2. the method according to claim 1 for converting structural data for unstructured data, which is characterized in that work as institute When the type for stating unstructured data to be transformed is text file, the algorithm model is specially LDA topic model.
3. the method according to claim 2 for converting structural data for unstructured data, which is characterized in that described According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non-structural The corresponding first object information of content for changing data specifically includes:
Determine the prior probability of each data in the content of the text file;
The similarity of each data in the content of the text file is calculated according to the prior probability;
The type or semanteme of each data in the content of the text file are determined according to the similarity, and use clustering algorithm Clustering processing is carried out to obtain the first object information to the data of same type or identical semanteme.
4. the method according to claim 1 for converting structural data for unstructured data, which is characterized in that work as institute When the type for stating unstructured data to be transformed is image file or video file, the algorithm model is specially depth nerve Network model.
5. the method according to claim 4 for converting structural data for unstructured data, which is characterized in that described According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non-structural Change the corresponding first object information of content of data specifically:
The first object information is extracted using the RBF radial basis function neural network in the deep neural network model.
6. the method according to claim 5 for converting structural data for unstructured data, which is characterized in that work as institute When the type for stating unstructured data to be transformed is image file, the RBF using in the deep neural network model Radial basis function neural network extracts the first object information and specifically includes:
Processing is split to described image file and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted feature Clustering is carried out to obtain the second target information, and using second target information as the first object information.
7. the method according to claim 5 for converting structural data for unstructured data, which is characterized in that work as institute When the type for stating unstructured data to be transformed is video file, the RBF using in the deep neural network model Radial basis function neural network extracts the first object information and specifically includes:
The content of the video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;By each institute It states sub-video and is converted to subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted feature Clustering is carried out to obtain third target information, and using the third target information as the first object information.
8. special according to claim 1 to the method for converting structural data for unstructured data described in 7 any one Sign is, described to convert structural data for the first object information according to predefined rule and specifically include:
The first object information is converted to according to the file template with the unstructured data to be transformed semi-structured Data;
MapReduce parallel processing is carried out to the semi-structured data;
Structural data is converted by the semi-structured data after the MapReduce parallel processing using XML technology.
9. a kind of device for converting unstructured data to structural data, including first structure conversion module, for will be to The target information of the unstructured data of conversion is converted into structural data, wherein the target information is included at least except described File title, storage address and file index information except the content of unstructured data to be transformed, which is characterized in that Further include:
Extraction module, for according to algorithm model corresponding to type with unstructured data to be transformed extract with it is described to The corresponding first object information of the content of the unstructured data of conversion;
Second thaumatropy module, for converting structural data for the first object information according to predefined rule to incite somebody to action The content transformation of the unstructured data to be transformed is structural data.
10. a kind of device for converting unstructured data to structural data characterized by comprising
Memory, for storing computer program;
Processor, for executing the computer program with realize as described in claim 1 to 8 any one will be unstructured Data are converted into the step of method of structural data.
CN201811289109.6A 2018-10-31 2018-10-31 Method and device for converting unstructured data into structured data Pending CN109344298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811289109.6A CN109344298A (en) 2018-10-31 2018-10-31 Method and device for converting unstructured data into structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811289109.6A CN109344298A (en) 2018-10-31 2018-10-31 Method and device for converting unstructured data into structured data

Publications (1)

Publication Number Publication Date
CN109344298A true CN109344298A (en) 2019-02-15

Family

ID=65312700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811289109.6A Pending CN109344298A (en) 2018-10-31 2018-10-31 Method and device for converting unstructured data into structured data

Country Status (1)

Country Link
CN (1) CN109344298A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method for transformation, system, storage medium and the electronic equipment of unstructured data
CN110321392A (en) * 2019-06-25 2019-10-11 北京海量数据技术股份有限公司 Data base management system based on sensor monitor data file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN111859863A (en) * 2020-06-03 2020-10-30 远光软件股份有限公司 Document structure conversion method and device, storage medium and electronic equipment
CN112395292A (en) * 2020-11-25 2021-02-23 电信科学技术第十研究所有限公司 Data feature extraction and matching method and device
CN112800755A (en) * 2021-02-05 2021-05-14 北京明略软件系统有限公司 Data management method and system
CN112966015A (en) * 2021-02-01 2021-06-15 杭州博联智能科技股份有限公司 Big data analysis processing and storage method, device, equipment and medium
CN113377950A (en) * 2021-06-02 2021-09-10 浪潮软件股份有限公司 Method for realizing flat storage and real-time preview of unstructured document
CN114003731A (en) * 2021-10-29 2022-02-01 国网河北省电力有限公司电力科学研究院 Heterogeneous data processing method, device, server and storage medium
CN115146084A (en) * 2022-07-14 2022-10-04 贵州电网有限责任公司 Method and device for acquiring equipment fault and maintenance data from unstructured data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463661A (en) * 2017-07-31 2017-12-12 小草数语(北京)科技有限公司 The introduction method and device of data
CN108268600A (en) * 2017-12-20 2018-07-10 北京邮电大学 Unstructured Data Management and device based on AI

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463661A (en) * 2017-07-31 2017-12-12 小草数语(北京)科技有限公司 The introduction method and device of data
CN108268600A (en) * 2017-12-20 2018-07-10 北京邮电大学 Unstructured Data Management and device based on AI

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李启炎等: "《全国CAD应用培训网络工程设计中心统编教材 企业商业智能教材》", 30 October 2007, 同济大学出版社 *
范春晓: "《Web数据分析关键技术及解决方案》", 30 October 2017, 北京邮电大学出版社 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method for transformation, system, storage medium and the electronic equipment of unstructured data
CN110321392A (en) * 2019-06-25 2019-10-11 北京海量数据技术股份有限公司 Data base management system based on sensor monitor data file
CN110866217A (en) * 2019-10-24 2020-03-06 长城计算机软件与系统有限公司 Cross report recognition method and device, storage medium and electronic equipment
CN111859863A (en) * 2020-06-03 2020-10-30 远光软件股份有限公司 Document structure conversion method and device, storage medium and electronic equipment
CN112395292A (en) * 2020-11-25 2021-02-23 电信科学技术第十研究所有限公司 Data feature extraction and matching method and device
CN112395292B (en) * 2020-11-25 2024-03-29 电信科学技术第十研究所有限公司 Data feature extraction and matching method and device
CN112966015A (en) * 2021-02-01 2021-06-15 杭州博联智能科技股份有限公司 Big data analysis processing and storage method, device, equipment and medium
CN112966015B (en) * 2021-02-01 2023-08-15 杭州博联智能科技股份有限公司 Big data analysis processing and storing method, device, equipment and medium
CN112800755A (en) * 2021-02-05 2021-05-14 北京明略软件系统有限公司 Data management method and system
CN113377950A (en) * 2021-06-02 2021-09-10 浪潮软件股份有限公司 Method for realizing flat storage and real-time preview of unstructured document
CN114003731A (en) * 2021-10-29 2022-02-01 国网河北省电力有限公司电力科学研究院 Heterogeneous data processing method, device, server and storage medium
CN115146084A (en) * 2022-07-14 2022-10-04 贵州电网有限责任公司 Method and device for acquiring equipment fault and maintenance data from unstructured data
CN115146084B (en) * 2022-07-14 2023-11-24 贵州电网有限责任公司 Method and device for acquiring equipment fault and maintenance data from unstructured data

Similar Documents

Publication Publication Date Title
CN109344298A (en) Method and device for converting unstructured data into structured data
Strezoski et al. Omniart: a large-scale artistic benchmark
CN104933164B (en) In internet mass data name entity between relationship extracting method and its system
US8868609B2 (en) Tagging method and apparatus based on structured data set
Van Ham et al. Mapping text with phrase nets
CN112131449A (en) Implementation method of cultural resource cascade query interface based on elastic search
CN110489565B (en) Method and system for designing object root type in domain knowledge graph body
CN112434168B (en) Knowledge graph construction method and fragmented knowledge generation method based on library
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN116975615A (en) Task prediction method and device based on video multi-mode information
Madan et al. Synthetically trained icon proposals for parsing and summarizing infographics
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
Yao Key frame extraction method of music and dance video based on multicore learning feature fusion
CN110309355A (en) Generation method, device, equipment and the storage medium of content tab
Girdhar et al. STRAS: A Semantic Textual-Cues Leveraged Rule-Based Approach for Article Separation in Historical Newspapers
Feng et al. Multiple style exploration for story unit segmentation of broadcast news video
CN113076468B (en) Nested event extraction method based on field pre-training
CN111046934B (en) SWIFT message soft clause recognition method and device
CN115168609A (en) Text matching method and device, computer equipment and storage medium
Pu et al. A vision-based approach for deep web form extraction
Chaudhary et al. A survey on image enhancement techniques using aesthetic community
Seenivasan ETL in a World of Unstructured Data: Advanced Techniques for Data Integration
Cuconato Epistemic logic for metadata modelling from scientific papers on COVID-19
ElGindy et al. Capturing place semantics on the geosocial web

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190215

RJ01 Rejection of invention patent application after publication