CN109344298A - Method and device for converting unstructured data into structured data - Google Patents
Method and device for converting unstructured data into structured data Download PDFInfo
- Publication number
- CN109344298A CN109344298A CN201811289109.6A CN201811289109A CN109344298A CN 109344298 A CN109344298 A CN 109344298A CN 201811289109 A CN201811289109 A CN 201811289109A CN 109344298 A CN109344298 A CN 109344298A
- Authority
- CN
- China
- Prior art keywords
- data
- unstructured data
- transformed
- unstructured
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 238000006243 chemical reaction Methods 0.000 claims abstract description 32
- 239000000284 extract Substances 0.000 claims abstract description 23
- 238000000605 extraction Methods 0.000 claims description 27
- 238000012545 processing Methods 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 21
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 238000003062 neural network model Methods 0.000 claims description 14
- 230000009466 transformation Effects 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 6
- 238000000638 solvent extraction Methods 0.000 claims description 5
- 210000005036 nerve Anatomy 0.000 claims 1
- 238000012800 visualization Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 5
- 238000007726 management method Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 8
- 238000000547 structure data Methods 0.000 description 6
- 230000009467 reduction Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013079 data visualisation Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 241001269238 Data Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- -1 i.e. Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011002 quantification Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000036299 sexual function Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013517 stratification Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for converting unstructured data into structured data, which can convert file titles, storage addresses, file index information and the like of unstructured data to be converted into structured data, and can extract first target information related to the content of the unstructured data in the unstructured data to be converted according to an algorithm model corresponding to the type of the unstructured data to be converted; and then the first target information is converted into structured data according to a predefined rule, and further the content of the unstructured data can be converted into the structured data. The unstructured data to be converted can be subjected to structured conversion in multiple aspects, and the unstructured data can be searched or managed by utilizing the content of the unstructured data, so that the visualization and query efficiency of the unstructured data is improved, and the management difficulty is reduced. In addition, the invention also discloses a device for converting unstructured data into structured data, and the effect is as above.
Description
Technical field
The present invention relates to data types to convert field, in particular to a kind of to convert structural data for unstructured data
Method and device.
Background technique
Data type used at present mainly includes three kinds of structures: structural data this category information can with data or uniformly
Structure indicated, and store in the database, have certain road structure, can be indicated with bivariate table.It is unstructured
This category information of data refers to that data structure is not fixed, and the data of two-dimensional data table representation can not be used, such as document, image and view
Frequently.Semi-structured data is a kind of data mode (such as XML, document) between structural data and unstructured data,
It is the structured data of tool, but structure change is very big.
The structural data of three types is characterized in: structural data is easily managed, search efficiency is high, reliability is high,
It is very low that permission control, management cost can be increased.Structural data is frequently stored in relevant database, can allow use
Person is more convenient, more efficient searches.But drawback maximum for structural data is exactly to be not easy to extend, and is had solid
Fixed format, template increase extremely difficult when data attribute.The importance of semi-structured data is highlighted increasingly, mainly
Because of its flexibility, semi-structured data is " non-mode ", and data are self-described, and are associated with the letter of its mode
Breath, this mode can arbitrarily change in centralized database at any time.And for unstructured data, although having good
Scalability and enough flexibly, but faces very big difficulty in data management, query aspects, so separation structure data are turned
Turning to structural data is particularly important.
Presently mainly the file title of nonstructured data type, storage address and mark etc. can be represented non-structural
The main information of data type is converted, and the structural datas such as file title, storage address and mark after conversion are utilized
It goes lookup or manages the file of nonstructured data type.But this transform mode is relatively simple, the file after will lead to conversion
Content be still the file being made of nonstructured data type, however it remains visualization and problem of management, and using existing
There are this transform mode management difficulty in technology and inquiry difficulty larger.
It can be seen that how to overcome the mode for being converted into structural data due to unstructured data single, and then cause
Unstructured data effect of visualization difference and inquiry and the big problem of management difficulty be that those skilled in the art are urgently to be resolved
The problem of.
Summary of the invention
The embodiment of the present application provides a kind of method and device for converting unstructured data to structural data, with solution
The mode for being certainly converted into structural data due to unstructured data in the prior art is single, and then caused unstructured data
Effect of visualization difference and inquiry and the big problem of management difficulty.
In order to solve the above technical problems, the present invention provides a kind of sides for converting unstructured data to structural data
Method, including converting structural data for the target information of unstructured data to be transformed, wherein the target information is at least
Including file title, storage address and the file index information in addition to the content of the unstructured data to be transformed,
It is characterized by further comprising:
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed
Unstructured data the corresponding first object information of content;
Structural data is converted with by the non-knot to be transformed by the first object information according to predefined rule
The content transformation of structure data is structural data.
Preferably, when the type of the unstructured data to be transformed is text file, the algorithm model is specific
For LDA topic model.
Preferably, algorithm model corresponding to type of the foundation with the unstructured data to be transformed extract with
The corresponding first object information of the content of the unstructured data to be transformed specifically includes:
Determine the prior probability of each data in the content of the text file;
The similarity of each data in the content of the text file is calculated according to the prior probability;
The type or semanteme of each data in the content of the text file are determined according to the similarity, and using cluster
Algorithm carries out clustering processing to the data of same type or identical semanteme to obtain the first object information.
Preferably, when the type of the unstructured data to be transformed is image file or video file, the calculation
Method model is specially deep neural network model.
Preferably, algorithm model corresponding to type of the foundation with the unstructured data to be transformed extract with
The corresponding first object information of the content of the unstructured data to be transformed specifically:
The first object letter is extracted using the RBF radial basis function neural network in the deep neural network model
Breath.
Preferably, described to use the depth when the type of the unstructured data to be transformed is image file
RBF radial basis function neural network in neural network model extracts the first object information and specifically includes:
Processing is split to described image and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted
Feature carries out clustering to obtain the second target information, and using second target information as the first object information.
Preferably, described to use the depth when the type of the unstructured data to be transformed is video file
RBF radial basis function neural network in neural network model extracts the first object information and specifically includes:
The content of the video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;
Each sub-video is converted into subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted
Feature carries out clustering to obtain third target information, and using the third target information as the first object information.
Preferably, described structural data is converted for the first object information according to predefined rule to specifically include:
The first object information is converted into half hitch according to the file template with the unstructured data to be transformed
Structure data;
MapReduce parallel processing is carried out to the semi-structured data;
Structural data is converted by the semi-structured data after the MapReduce parallel processing using XML technology.
In order to solve the above technical problems, being converted into structural data the present invention also provides one kind and by unstructured data
The corresponding device for converting unstructured data to structural data of method, including first structure conversion module, for will
The target information of unstructured data to be transformed is converted into structural data, wherein the target information, which includes at least, removes institute
State file title, storage address and the file index information except the content of unstructured data to be transformed, further includes:
Extraction module, for according to the extraction of algorithm model corresponding to the type with unstructured data to be transformed and institute
State the corresponding first object information of content of unstructured data to be transformed;
Second thaumatropy module, for converting structural data for the first object information according to predefined rule
Using by the content transformation of the unstructured data to be transformed as structural data.
In order to solve the above technical problems, being converted into structuring number the present invention also provides another kind and by unstructured data
According to the corresponding device for converting unstructured data to structural data of method, comprising:
Memory, for storing computer program;
Processor, for executing the computer program to realize that unstructured data is converted knot by any one of the above
The step of structure data.
Compared with the prior art, a kind of side converting unstructured data to structural data provided by the present invention
Method, in addition to that can convert the file title of unstructured data to be transformed, storage address and file index information etc. to
Except structural data, it can also extract according to algorithm model corresponding to the type of unstructured data to be transformed wait turn
First object information relevant to the unstructured data content to be transformed in the unstructured data of change;Further according to predefined
First object information is converted structural data by rule, and then can be structuring number by the content transformation of unstructured data
According to.Structuring conversion can be carried out to unstructured data to be transformed from many aspects, utilize the content of unstructured data
Unstructured data can also be searched or be managed to information, and visualization and the search efficiency, reduction management of unstructured data can be improved
Difficulty.In addition, the present invention also provides a kind of device for converting unstructured data to structural data, effect is as above.
Detailed description of the invention
Fig. 1 is a kind of method flow for converting unstructured data to structural data provided by the embodiment of the present invention
Figure;
Fig. 2 is a kind of device composition for converting unstructured data to structural data provided by the embodiment of the present invention
Schematic diagram;
Fig. 3 is the device group that another kind provided by the embodiment of the present invention converts unstructured data to structural data
At schematic diagram.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art without making creative work it is obtained it is all its
Its embodiment, shall fall within the protection scope of the present invention.
Core of the invention is to provide a kind of method and device for converting unstructured data to structural data, can be with
It is single to solve the mode that structural data is converted into due to unstructured data, and then caused unstructured data visualization effect
Fruit difference and inquiry and the big problem of management difficulty.
Scheme in order to enable those skilled in the art to better understand the present invention, with reference to the accompanying drawing and specific embodiment party
The present invention is described in further detail for formula.
Fig. 1 is a kind of method flow for converting unstructured data to structural data provided by the embodiment of the present invention
Figure, as shown in Figure 1, method includes the following steps:
S101: structural data is converted by the target information of unstructured data to be transformed, wherein target information is extremely
It less include file title, storage address and the file index information in addition to the content of unstructured data to be transformed.
Particularly as be can be with by the file title of nonstructured data type to be transformed, storage address and index information etc.
The main information for representing nonstructured data type to be transformed is converted, file title, storage ground after can use conversion
The structural datas such as location and index information go to search or manage unstructured data.Unstructured data is exactly file in fact,
Such as picture, video etc..
S102: according to algorithm model corresponding to type with unstructured data to be transformed extract with it is to be transformed non-
The corresponding first object information of the content of structural data.
Particularly as being different types of unstructured data to be transformed, need to extract and be somebody's turn to do using different algorithm models
The corresponding first object information of type unstructured data content to be transformed, first object information are exactly to be transformed non-in fact
Key message in structural data content.
S103: structural data is converted for first object information with will be to be transformed unstructured according to predefined rule
The content transformation of data is structural data.
After extracting the key message in unstructured data content to be transformed, so that it may using predefined rule by the
One target information be converted into structural data achieve that unstructured data content to be transformed structuring conversion, using to
The unstructured data content of conversion can be inquired or be managed to unstructured data to be transformed, and then can be improved non-
The search efficiency of structural data, the management difficulty for reducing unstructured data, and can be in order to visualizing.In practical application
In, the sequencing that step S101 and step S102 are not carried out can first carry out step S101, can also first carry out step
S102, under the premise of conditions permit, step S101 and step S102 be may also be performed simultaneously.That is, to be transformed
The content of unstructured data carry out structuring conversion with to unstructured data to be transformed file title, storage address with
And file index information etc. carries out structuring conversion, the sequencing that the two is not carried out, the tool of step S101 and step S102
The body execution sequence present invention is simultaneously not construed as limiting.Unstructured data to be transformed in the embodiment of the present application mainly includes text text
Part, picture file and video file.It is transformed to the structuring of text file, picture file and video file separately below
Journey is described in detail.
First, the type of unstructured data to be transformed is the structuring conversion process of text file.
In order to ensure the coverage rate and accuracy of the text file content keyword of extraction, on the basis of above-described embodiment
On, preferably embodiment, when the type of unstructured data to be transformed is text file, algorithm model is specially
LDA topic model.
In order to further increase the extraction coverage rate and accuracy of text file content keyword, when using LDA theme mould
When type carries out structuring conversion to the content of text file, on the basis of the above embodiments, preferably embodiment, according to
According to algorithm model corresponding to the type of unstructured data to be transformed extract in unstructured data to be transformed
Hold corresponding first object information to specifically include:
Determine the prior probability of each data in the content of text file;
The similarity of each data in the content of text file is calculated according to prior probability;
The type or semanteme of each data in the content of text file are determined according to similarity, and using clustering algorithm to phase
Same type or the data of identical semanteme carry out clustering processing to obtain first object information.
The first step extracts the key word information (first object information) in text file content.Keyword extraction is from text
This document concentrates the step for extracting reliable significant word or phrase key, it affects subsequent step.These words or
Person's phrase often has fixed structure, and Topic word is more significant, and semantic height is identical, and field belongs to the spies such as characteristic remarkable
Point is usually used to describe the classification informations such as field associated topic, the knowledge.Therefore the content information of text file extracts referred to as whole
The most key and basic step as soon as a content of text is classified, without accurately, comprehensively extracting keyword and having no idea to protect
Demonstrate,prove the architectonic coverage rate of entire text information and accuracy.
Accurate and significance using the more other model extraction key word informations of LDA topic model is higher, LDA theme mould
Type essence is to use three layers of bayesian probability model, includes word, theme, document three-decker, is a kind of non-supervisory engineering
Habit technology, LDA topic model can be used to identify extensive document sets, the main information in corpus, first have to obtain this article
The prior probability of each data in this document, and every text is considered as by a word frequency vector using bag of words method, in this way
Facilitate the mathematical model for converting the word frequency vector in text to and being easy to model, but bag of words do not consider text file
Sequence in content between data and data, therefore according to the probability sorting that can obtain some data after the modeling of word frequency vector.
Second step calculates the similarity of each data in text file content.
Word or phrase are the minimum units for constituting a word or an article, are to carry out certainly for Word similarity
The premise and basis of right Language Processing and text knowledge mining.Word similarity can be large batch of data information
Match, the tasks such as search engine quick response user provide most important technical support.It can be according to huge building of corpus one
A key words similarity Automatic computing system, and then determine the similarity mode algorithm of optimal policy.
Third step determines the type or semanteme of each data in text file content, and using clustering algorithm to same type
Or the data of identical semanteme carry out clustering processing to obtain the second target information.
Specifically, the data for being suitable for same type or identical semanteme are subjected to clustering processing, it is possible to reduce huge meter
Calculation amount, the semantic relation between each data of quantification treatment are obtained based on clustering algorithm (such as K-Means cluster, hierarchical clustering)
Hierarchical relationship between each data further sorts out data, and the content in text file can be melted into several classes for showing theme
Word, and then obtain the second target information.The embodiment of the present application can efficiently and accurately be extracted in text file content
Key word information.
The first object information obtained is converted structural data text file by 4th step.
Utilize optimal text key word information extracting method, optimal keyword similarity mode algorithm and optimum cluster
The result that parser is got, which is combined into formalization method, indicates that conversion according to certain rules, processing are converted into half hitch
Structure data.The process of unstructured data rotation structure data specifically will introduce below, and wouldn't repeat herein.
Second, the type of unstructured data to be transformed is the structuring conversion process of image file or video file.
In order to improve the extraction accuracy of file content key message, on the basis of the above embodiments, preferably
Embodiment, when the type of unstructured data to be transformed is image file or video file, the specific depth of algorithm model
Neural network model.In order to improve the extraction efficiency of file content key message, when use deep neural network model is to image
When the content of file or video file carries out structuring conversion, preferably embodiment, according to it is to be transformed non-structural
Change algorithm model corresponding to the type of data and extracts first object letter corresponding with the content of unstructured data to be transformed
Breath specifically:
First object information is extracted using the RBF radial basis function neural network in deep neural network model.
Deep neural network model is a novel Artificial Neural Network, has local sensing region, level knot
It the features such as overall situation training that structure, feature extraction and assorting process combine, is had been widely used in field of image recognition.?
Several filter layers with different size feature extraction are constructed in the algorithm model, these models are applied to image and are known
In other problem.The characteristics of feeling grateful region extraction feature according to stratification possessed by deep neural network model and part, is fitted
When the quantity that increase has perceived, the feature quantity and quality that each layer can extract in network can be improved, to improve depth
The recognition capability of neural network model, and there is better robustness.
In order to improve the extraction efficiency of key message, the field to every frame image is limited in the embodiment of the present application, is contracted
Small semantic interval existing between low-level features and level concepts.Therefore by support vector machines as model learning, in model
Core uses gaussian radial basis function, so that radial basis function classifiers are obtained, because sample can be mapped to one by RBF core
The space of more higher-dimension, it can handle the non-linear relation between picture tag and feature, be the very strong kernel function of locality, tool
There are quite high flexibility and most popular kernel function.Parameter regulation, good ginseng are considered when using RBF kernel function
Number can make classifier correctly predict unknown data, obtain high training accuracy, i.e., classifier prediction class label is correct
Rate.
A kind of method converting structural data for unstructured data provided by the present invention, in addition to can will wait turn
File title, storage address and file index information of the unstructured data of change etc. are converted into except structural data, also
Unstructured data to be transformed can be extracted according to algorithm model corresponding to the type of unstructured data to be transformed
In first object information relevant to the unstructured data content to be transformed;First object is believed further according to predefined rule
Breath is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from many aspects
Structuring conversion is carried out to unstructured data to be transformed, can also search or manage using the content information of unstructured data
Visualization and the search efficiency, reduction management difficulty of unstructured data can be improved in unstructured data.
In order to further increase the extraction rate to key message in content of image files, on the basis of above-described embodiment
On, preferably embodiment, when the type of unstructured data to be transformed is image file, using depth nerve net
RBF radial basis function neural network in network model extracts first object information and specifically includes:
Processing is split to image file and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using RBF radial basis function neural network, and extracted feature is carried out
Clustering is to obtain the second target information, and using the second target information as first object information.
Particularly as being that image file is divided into multiple word images according to partitioning parameters, feature then is carried out to each subgraph
It extracts, i.e., extracts key message from each subgraph, finally obtain the key message extracted in each subgraph progress clustering
Second target information out, and using the second target information as the key message (first object information) of final image file content.
Feature selecting and the basic task of extraction are the most effective features of multiple features concentration selection of comforming during image characteristics extraction.
So-called characteristics of image is effectively mapping image itself to be extracted from color image, but be different from the spy of other characteristics of image
Point.The feature of more effective basis of characterization can be obtained by feature extraction, and reduces the dimension of metric space, by image
Identification, which is placed in the feature space of low-dimensional, to be carried out, and the identification quality of image is greatly improved.
On the basis of the above embodiments, preferably embodiment, when the type of unstructured data to be transformed
When for video file, first object letter is extracted using the RBF radial basis function neural network in the deep neural network model
Breath specifically includes:
The content of video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;
Each sub-video is converted into subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using RBF radial basis function neural network, and extracted feature is carried out
Clustering is to obtain third target information, and using third target information as first object information.
Particularly as being that video file is first divided into multiple sub-videos, then multiple sub-videos are passed through to the side analyzed frame by frame
Formula is converted to subgraph, that is, is converted to image file, finally extracts the third target information of subgraph, and by third target information
As first object information.Namely the processing mode of video file can be equal to the processing mode of image file.
Video file is not only the maximum data of storage size, while being also most typical isomery big data, video file
Different data types is corresponded in different processing stages: unstructured data (video, image), semi-structured data are (special
Sign), structuring (feature vector, description attribute).The method of video image data processing is gradually to convert unstructured data
Statistics and association process analysis are then done for semi-structured data, are finally converted to structural data storage in the database.
Video data label helps to extract the Stability and veracity of the content and description in video file, so that view
The parser of frequency content is more targeted, more more detailed better to the structural description of video file content in principle, still
This is very harsh to video attribute label requirement, therefore the scene of video file is drawn according to color, scene, time etc.
Point.Video file data has contained largely unstructured data, and video content excavation is by decoding to video file
It is analyzed frame by frame afterwards.Video file is split according to parameter frame number, number first.Partitioning parameters are to carry out video text
The premise and basis of part processing accuracy and reliability.Sub-video file carries out attribute label, and the quality of attribute label is directly
It influences whether to the comprehensive of the structural description of video content.
It is exactly that the process of information extraction is carried out to every frame picture in the essence to video file analytic process frame by frame.Video text
The part that a certain frame picture of a certain frame picture and previous moment in part compares notable difference is the main right of detection
As determining potential difference section by background modeling, to Target Segmentation algorithm.In order to accelerate the rapid convergence of training pattern,
It can generally be pre-processed before image recognition, include: remove noise, input dimensionality reduction data and delete unrelated data etc..
In order to improve conversion rate, on the basis of the above embodiments, preferably embodiment, according to predefined rule
Then structural data is converted by first object information to specifically include:
First object information is converted into semi-structured data according to the file template with unstructured data to be transformed;
MapReduce parallel processing is carried out to semi-structured data;
Structural data is converted by the semi-structured data after MapReduce parallel processing using XML technology.
Particularly as being that will convert according to certain rules, handle by the key message that classification generates in unstructured content
It is converted into semi-structured data.Either text file, image file or video file are extracting in respective file
After the key message of appearance, structuring conversion, semi-structured data performance can be carried out by the mode in the embodiment of the present application
Form is generally stored by XML file, that is, by the key message (first object information) of extraction, i.e., to unstructured number
According to progress XMLization.To achieve the purpose that unstructured data is managed using XML.For the XMLization processing method of text file,
The all included conversion function of the more recent version of Microsoft Office or tool, can easily by Office series documentation to
The conversion of XML document.User can also according to their own needs, and the content and structure of WORD document writes phase in analysis power domain
The program answered exports suitable XML document using oneself XSLT is suitble to.In addition, can also be used for by some special tools
These documents are converted into XML document.For the XMLization processing method of picture, video, audio files, corresponding XML text is established
Shelves record the key message extracted in the file contents such as picture, video, sound, animation, are needing to use these files
When, it can be searched and be screened according to the content in XML document.And according to being recorded in document and respective file content
Relevant key message is called.It that is can be by the content information of unstructured data to related unstructured number
According to being inquired.As the document of text class can should mutually have conversion program using converting or writing step by step according to the characteristics of document
It is converted into XML document, other types of document carries out it linking by the method for XML document storage object properties etc..
XML file mainly has a following characteristics: first, simplicity, entire XML document has stringent format to define, whole
Body seems concise.Second, it is open, XML standard itself and document be on WEB it is wide-open, anyone can
Free reading specification, used label and text.Third, efficiently and expandability, XML support multiplexing document segment, make
User can create and using oneself label, can also be shared with other people, and extendability is larger.4th, high universalizable, XML tool
There is unified sexual function, supports most of spoken and written languages in the world.Unstructured data is converted by XML data structuring
After XML document, the management of unstructured data is transferred in the management to XML document.And the management to XML data,
Industry has the way to manage and method of comparative maturity, so that the management of unstructured data also becomes easy.XML data
It is typical semi-structured data, by the mapping for establishing XML and relational database.It converts, handle according to certain rules, it can
It is converted into being supported by traditional database based on relational model for structural data.
But in practical applications, because unstructured data has multiple types, it is converted into semi-structured XML
File also has multiple types, also becomes more and more big therewith as data volume increases XML file quantity.Because of XML file category
In semi-structured data, these factors make the query processing that XML file is not suitable for using the relevant database of structure.Cause
This before XML file is converted into structural data, carries out XML file using MapReduce in the embodiment of the present application
Parallel processing, MapReduce are a distributed computing frameworks, are applied in big data development platform Hadoop, which can
It being deployed in cheap PC cluster, data can be distributed each node in the cluster, thus realize the parallel processing of data, thus
MapReduce is used for the data query of XML.XML is defined inside DTD the element inventories of all Doctypes, attribute,
Label, the entity in document and its correlation.DTD is also that XML document structure has formulated set of rule.Carrying out document sum number
According to library conversion when, DTD document can be made full use of, to set up the database structure for more meeting original text shelves, and will
Information in document all being stored in database as far as possible.
The DTD process for generating a relational structures is illustrated below:
The first step obtains the data relationship table between all data item and data item according to DTD document.It is calculated by correlation
Method, so that it may all elements in DTD document and their essential information all be saved in data structure, then resettle one pair
The tables of data for answering this structure, stores information in relational database, thereby realizes in XML unstructured data to relationship
The first step of structural data conversion in database.
Second step is established the main table of database and sublist according to the data structure being established above according to data relationship table, is being looked for
To after basic element.Corresponding relation database table (essential information of reflection XML document) is set up, this table is referred to as basic
Table.Table name is elementary name, and the field in base table is basic literary name section by basic element.According to object information, establish
All main table and sublist are played, and saves location information of the corresponding element in XML document.
Third step includes the uncertain sublist of additional character daughter element according to the building of the different meanings of additional character.To this
A little element is numbered, and determines a number upper limit to distinguish them.But it if it does, can be brought to database greatly
The data of amount cause a large amount of disk to waste.So can by the case where the uncertain element of a table is few them
It is divided into different records to save.
4th step realizes that the data in XML document are converted to relational database.On the basis of establishing database.By XML
The data conversion of element form in document is at the data for recording form in relational database.By above step realize from
Conversion of the XML document to relational database.To realize management of the XML to unstructured data.
It has been carried out in detail above for a kind of embodiment of method for converting unstructured data to structural data
Description, a kind of method for converting structural data for unstructured data described based on the above embodiment, the present invention are implemented
Example additionally provides a kind of device for converting unstructured data to structural data corresponding with this method.Due to device part
Embodiment corresponded to each other with the embodiment of method part, therefore the embodiment of device part please refers to the embodiment of method part
Description, which is not described herein again.
Fig. 2 is a kind of device composition for converting unstructured data to structural data provided by the embodiment of the present invention
Schematic diagram, as shown in Fig. 2, the device includes first structure conversion module 201, extraction module 201 and the second thaumatropy module
203。
First structure conversion module 201, for converting structuring for the target information of unstructured data to be transformed
Data, wherein target information is including at least the file title in addition to the content of unstructured data to be transformed, storage address
And file index information.
Extraction module 202, for being extracted according to algorithm model corresponding to the type with unstructured data to be transformed
First object information corresponding with the content of unstructured data to be transformed;
Second thaumatropy module 203, for converting structural data for first object information according to predefined rule
Using by the content transformation of unstructured data to be transformed as structural data.
A kind of device converting unstructured data to structural data provided by the present invention, in addition to can will wait turn
File title, storage address and file index information of the unstructured data of change etc. are converted into except structural data, also
Unstructured data to be transformed can be extracted according to algorithm model corresponding to the type of unstructured data to be transformed
In first object information relevant to the unstructured data content to be transformed;First object is believed further according to predefined rule
Breath is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from many aspects
Structuring conversion is carried out to unstructured data to be transformed, can also search or manage using the content information of unstructured data
Visualization and the search efficiency, reduction management difficulty of unstructured data can be improved in unstructured data.
Retouch in detail above for a kind of embodiment of the method for converting structural data for unstructured data
It states, a kind of method for converting structural data for unstructured data described based on the above embodiment, the embodiment of the present invention
Additionally provide another device for converting unstructured data to structural data corresponding with this method.Due to device part
Embodiment corresponded to each other with the embodiment of method part, therefore the embodiment of device part please refers to the embodiment of method part
Description, which is not described herein again.
Fig. 3 is the device group that another kind provided by the embodiment of the present invention converts unstructured data to structural data
At schematic diagram, as shown in figure 3, the device includes memory 301 and processor 302.
Memory 301, for storing computer program;
Processor 302, realizing for executing computer program will be non-structural provided by any one above-mentioned embodiment
Change the step of data are converted into structural data.
Another kind provided by the present invention converts unstructured data in the device of structural data, in addition to can will be to
File title, storage address and file index information of the unstructured data of conversion etc. are converted into except structural data,
Unstructured number to be transformed can also be extracted according to algorithm model corresponding to the type of unstructured data to be transformed
The first object information relevant to the unstructured data content to be transformed in;Further according to predefined rule by first object
Information is converted into structural data, and then can be structural data by the content transformation of unstructured data.It can be from multiple sides
Structuring conversion is carried out in face of unstructured data to be transformed, can also search or manage using the content information of unstructured data
Unstructured data is managed, visualization and the search efficiency, reduction management difficulty of unstructured data can be improved.
Above to it is provided by the present invention it is a kind of by unstructured data be converted into the method and device of structural data into
It has gone and has been discussed in detail.With several examples, principle and implementation of the present invention are described herein, above embodiments
Explanation, be merely used to help understand method and its core concept of the invention;Meanwhile for the general technology people of this field
Member, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion this explanation
Book content should not be construed as limiting the invention, those skilled in the art, under the premise of no creative work, to this hair
Bright made modification, equivalent replacement, improvement etc., should be included in the application.
It should also be noted that, in the present specification, relational terms such as first and second and the like be used merely to by
One operation is distinguished with another operation, without necessarily requiring or implying there are any between these entities or operation
This actual relationship or sequence.Moreover, the similar word such as term " includes ", so that including the unit of a series of elements, equipment
Or system not only includes those elements, but also including other elements that are not explicitly listed, or further includes for this list
Member, equipment or the intrinsic element of system.
Claims (10)
1. a kind of method for converting structural data for unstructured data, including by the mesh of unstructured data to be transformed
Mark information is converted into structural data, wherein the target information is included at least except the unstructured data to be transformed
File title, storage address and file index information except content, which is characterized in that further include:
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non-
The corresponding first object information of the content of structural data;
Structural data is converted with will be described to be transformed unstructured by the first object information according to predefined rule
The content transformation of data is structural data.
2. the method according to claim 1 for converting structural data for unstructured data, which is characterized in that work as institute
When the type for stating unstructured data to be transformed is text file, the algorithm model is specially LDA topic model.
3. the method according to claim 2 for converting structural data for unstructured data, which is characterized in that described
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non-structural
The corresponding first object information of content for changing data specifically includes:
Determine the prior probability of each data in the content of the text file;
The similarity of each data in the content of the text file is calculated according to the prior probability;
The type or semanteme of each data in the content of the text file are determined according to the similarity, and use clustering algorithm
Clustering processing is carried out to obtain the first object information to the data of same type or identical semanteme.
4. the method according to claim 1 for converting structural data for unstructured data, which is characterized in that work as institute
When the type for stating unstructured data to be transformed is image file or video file, the algorithm model is specially depth nerve
Network model.
5. the method according to claim 4 for converting structural data for unstructured data, which is characterized in that described
According to algorithm model corresponding to type with the unstructured data to be transformed extract with it is described to be transformed non-structural
Change the corresponding first object information of content of data specifically:
The first object information is extracted using the RBF radial basis function neural network in the deep neural network model.
6. the method according to claim 5 for converting structural data for unstructured data, which is characterized in that work as institute
When the type for stating unstructured data to be transformed is image file, the RBF using in the deep neural network model
Radial basis function neural network extracts the first object information and specifically includes:
Processing is split to described image file and obtains multiple subgraphs;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted feature
Clustering is carried out to obtain the second target information, and using second target information as the first object information.
7. the method according to claim 5 for converting structural data for unstructured data, which is characterized in that work as institute
When the type for stating unstructured data to be transformed is video file, the RBF using in the deep neural network model
Radial basis function neural network extracts the first object information and specifically includes:
The content of the video file is split processing according to the partitioning parameters determined and obtains multiple sub-videos;By each institute
It states sub-video and is converted to subgraph by the way of analyzing frame by frame;
Feature extraction is carried out to each subgraph using the RBF radial basis function neural network, and to extracted feature
Clustering is carried out to obtain third target information, and using the third target information as the first object information.
8. special according to claim 1 to the method for converting structural data for unstructured data described in 7 any one
Sign is, described to convert structural data for the first object information according to predefined rule and specifically include:
The first object information is converted to according to the file template with the unstructured data to be transformed semi-structured
Data;
MapReduce parallel processing is carried out to the semi-structured data;
Structural data is converted by the semi-structured data after the MapReduce parallel processing using XML technology.
9. a kind of device for converting unstructured data to structural data, including first structure conversion module, for will be to
The target information of the unstructured data of conversion is converted into structural data, wherein the target information is included at least except described
File title, storage address and file index information except the content of unstructured data to be transformed, which is characterized in that
Further include:
Extraction module, for according to algorithm model corresponding to type with unstructured data to be transformed extract with it is described to
The corresponding first object information of the content of the unstructured data of conversion;
Second thaumatropy module, for converting structural data for the first object information according to predefined rule to incite somebody to action
The content transformation of the unstructured data to be transformed is structural data.
10. a kind of device for converting unstructured data to structural data characterized by comprising
Memory, for storing computer program;
Processor, for executing the computer program with realize as described in claim 1 to 8 any one will be unstructured
Data are converted into the step of method of structural data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811289109.6A CN109344298A (en) | 2018-10-31 | 2018-10-31 | Method and device for converting unstructured data into structured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811289109.6A CN109344298A (en) | 2018-10-31 | 2018-10-31 | Method and device for converting unstructured data into structured data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109344298A true CN109344298A (en) | 2019-02-15 |
Family
ID=65312700
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811289109.6A Pending CN109344298A (en) | 2018-10-31 | 2018-10-31 | Method and device for converting unstructured data into structured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344298A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134858A (en) * | 2019-03-26 | 2019-08-16 | 国网重庆市电力公司 | Method for transformation, system, storage medium and the electronic equipment of unstructured data |
CN110321392A (en) * | 2019-06-25 | 2019-10-11 | 北京海量数据技术股份有限公司 | Data base management system based on sensor monitor data file |
CN110866217A (en) * | 2019-10-24 | 2020-03-06 | 长城计算机软件与系统有限公司 | Cross report recognition method and device, storage medium and electronic equipment |
CN111859863A (en) * | 2020-06-03 | 2020-10-30 | 远光软件股份有限公司 | Document structure conversion method and device, storage medium and electronic equipment |
CN112395292A (en) * | 2020-11-25 | 2021-02-23 | 电信科学技术第十研究所有限公司 | Data feature extraction and matching method and device |
CN112800755A (en) * | 2021-02-05 | 2021-05-14 | 北京明略软件系统有限公司 | Data management method and system |
CN112966015A (en) * | 2021-02-01 | 2021-06-15 | 杭州博联智能科技股份有限公司 | Big data analysis processing and storage method, device, equipment and medium |
CN113377950A (en) * | 2021-06-02 | 2021-09-10 | 浪潮软件股份有限公司 | Method for realizing flat storage and real-time preview of unstructured document |
CN114003731A (en) * | 2021-10-29 | 2022-02-01 | 国网河北省电力有限公司电力科学研究院 | Heterogeneous data processing method, device, server and storage medium |
CN115146084A (en) * | 2022-07-14 | 2022-10-04 | 贵州电网有限责任公司 | Method and device for acquiring equipment fault and maintenance data from unstructured data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463661A (en) * | 2017-07-31 | 2017-12-12 | 小草数语(北京)科技有限公司 | The introduction method and device of data |
CN108268600A (en) * | 2017-12-20 | 2018-07-10 | 北京邮电大学 | Unstructured Data Management and device based on AI |
-
2018
- 2018-10-31 CN CN201811289109.6A patent/CN109344298A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107463661A (en) * | 2017-07-31 | 2017-12-12 | 小草数语(北京)科技有限公司 | The introduction method and device of data |
CN108268600A (en) * | 2017-12-20 | 2018-07-10 | 北京邮电大学 | Unstructured Data Management and device based on AI |
Non-Patent Citations (2)
Title |
---|
李启炎等: "《全国CAD应用培训网络工程设计中心统编教材 企业商业智能教材》", 30 October 2007, 同济大学出版社 * |
范春晓: "《Web数据分析关键技术及解决方案》", 30 October 2017, 北京邮电大学出版社 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110134858A (en) * | 2019-03-26 | 2019-08-16 | 国网重庆市电力公司 | Method for transformation, system, storage medium and the electronic equipment of unstructured data |
CN110321392A (en) * | 2019-06-25 | 2019-10-11 | 北京海量数据技术股份有限公司 | Data base management system based on sensor monitor data file |
CN110866217A (en) * | 2019-10-24 | 2020-03-06 | 长城计算机软件与系统有限公司 | Cross report recognition method and device, storage medium and electronic equipment |
CN111859863A (en) * | 2020-06-03 | 2020-10-30 | 远光软件股份有限公司 | Document structure conversion method and device, storage medium and electronic equipment |
CN112395292A (en) * | 2020-11-25 | 2021-02-23 | 电信科学技术第十研究所有限公司 | Data feature extraction and matching method and device |
CN112395292B (en) * | 2020-11-25 | 2024-03-29 | 电信科学技术第十研究所有限公司 | Data feature extraction and matching method and device |
CN112966015A (en) * | 2021-02-01 | 2021-06-15 | 杭州博联智能科技股份有限公司 | Big data analysis processing and storage method, device, equipment and medium |
CN112966015B (en) * | 2021-02-01 | 2023-08-15 | 杭州博联智能科技股份有限公司 | Big data analysis processing and storing method, device, equipment and medium |
CN112800755A (en) * | 2021-02-05 | 2021-05-14 | 北京明略软件系统有限公司 | Data management method and system |
CN113377950A (en) * | 2021-06-02 | 2021-09-10 | 浪潮软件股份有限公司 | Method for realizing flat storage and real-time preview of unstructured document |
CN114003731A (en) * | 2021-10-29 | 2022-02-01 | 国网河北省电力有限公司电力科学研究院 | Heterogeneous data processing method, device, server and storage medium |
CN115146084A (en) * | 2022-07-14 | 2022-10-04 | 贵州电网有限责任公司 | Method and device for acquiring equipment fault and maintenance data from unstructured data |
CN115146084B (en) * | 2022-07-14 | 2023-11-24 | 贵州电网有限责任公司 | Method and device for acquiring equipment fault and maintenance data from unstructured data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344298A (en) | Method and device for converting unstructured data into structured data | |
Strezoski et al. | Omniart: a large-scale artistic benchmark | |
CN104933164B (en) | In internet mass data name entity between relationship extracting method and its system | |
US8868609B2 (en) | Tagging method and apparatus based on structured data set | |
Van Ham et al. | Mapping text with phrase nets | |
CN112131449A (en) | Implementation method of cultural resource cascade query interface based on elastic search | |
CN110489565B (en) | Method and system for designing object root type in domain knowledge graph body | |
CN112434168B (en) | Knowledge graph construction method and fragmented knowledge generation method based on library | |
CN110674297B (en) | Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment | |
CN116975615A (en) | Task prediction method and device based on video multi-mode information | |
Madan et al. | Synthetically trained icon proposals for parsing and summarizing infographics | |
CN112000929A (en) | Cross-platform data analysis method, system, equipment and readable storage medium | |
CN109271624A (en) | A kind of target word determines method, apparatus and storage medium | |
Yao | Key frame extraction method of music and dance video based on multicore learning feature fusion | |
CN110309355A (en) | Generation method, device, equipment and the storage medium of content tab | |
Girdhar et al. | STRAS: A Semantic Textual-Cues Leveraged Rule-Based Approach for Article Separation in Historical Newspapers | |
Feng et al. | Multiple style exploration for story unit segmentation of broadcast news video | |
CN113076468B (en) | Nested event extraction method based on field pre-training | |
CN111046934B (en) | SWIFT message soft clause recognition method and device | |
CN115168609A (en) | Text matching method and device, computer equipment and storage medium | |
Pu et al. | A vision-based approach for deep web form extraction | |
Chaudhary et al. | A survey on image enhancement techniques using aesthetic community | |
Seenivasan | ETL in a World of Unstructured Data: Advanced Techniques for Data Integration | |
Cuconato | Epistemic logic for metadata modelling from scientific papers on COVID-19 | |
ElGindy et al. | Capturing place semantics on the geosocial web |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190215 |
|
RJ01 | Rejection of invention patent application after publication |