CN108228664B - Unstructured data processing method and device - Google Patents

Unstructured data processing method and device Download PDF

Info

Publication number
CN108228664B
CN108228664B CN201611197679.3A CN201611197679A CN108228664B CN 108228664 B CN108228664 B CN 108228664B CN 201611197679 A CN201611197679 A CN 201611197679A CN 108228664 B CN108228664 B CN 108228664B
Authority
CN
China
Prior art keywords
data
unstructured data
processing
unstructured
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611197679.3A
Other languages
Chinese (zh)
Other versions
CN108228664A (en
Inventor
陈毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Shanghai Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Shanghai Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201611197679.3A priority Critical patent/CN108228664B/en
Publication of CN108228664A publication Critical patent/CN108228664A/en
Application granted granted Critical
Publication of CN108228664B publication Critical patent/CN108228664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an unstructured data processing method and device, wherein the method comprises the following steps: processing the acquired unstructured data to acquire data in a target format; identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result; and generating a corresponding index for the structured data by calling an index generating tool of a preset server. The unstructured data processing method and the unstructured data processing device can realize unified access to various complex unstructured data sources, enhance the processing method of data entities in the unstructured data sources, ensure the effectiveness and efficiency of data reading, realize hot-pluggable data processing, improve the flexibility of system configuration, and provide efficiency and effectiveness guarantee for processing large-scale unstructured data sources by providing a fault-tolerant mechanism.

Description

Unstructured data processing method and device
Technical Field
The present invention relates to the field of data service technologies, and in particular, to a method and an apparatus for processing unstructured data.
Background
Currently, the data processing method mainly includes an application programming interface API, a data warehouse technology ETL, a DB data interface, and a Message Queue (MQ). Specifically, the API is to develop an interface program between applications, extract original data according to business logic by using a communication protocol (such as SOAP, HTTP, and the like), and write the extracted data into a target database; ETL is to establish a data channel between a data source and a data target by using the existing and mature tools, and to import data from a source library into a target library through a data engine; the DB interface is arranged between relational databases, configures database connection, reads original data according to database table entries, and inserts results into a target database; message Queues (MQs) encapsulate data by means of messages, and send message data to target data by means of queues.
However, the above data processing methods have certain disadvantages when performing data extraction. Specifically, since the API adopts a hard coding mode, the code flexibility is insufficient, the code coupling is high, the change cost is high, and the processing of resources such as rich text objects, xml/HTTP and the like is complex; the existing ETL tool has higher development and use threshold, higher development complexity and insufficient development flexibility, and has poorer support for resources such as rich text objects, xml/HTTP and the like; the DB interface is realized at the bottom layer of the database, when the DB interface is used for processing complex problems, the expansibility is poor, the DB interface is mainly limited in the same database, the heterogeneous database is difficult, and the support for resources such as rich text objects, xml/HTTP and the like is poor; message Queues (MQ) are characterized by asynchronous processing, and therefore cannot meet the problem of high real-time requirements, and support for rich text objects, xml/HTTP and other resources is poor.
Therefore, the existing data processing method has poor compatibility and flexibility when extracting different types of data sources.
Disclosure of Invention
Aiming at the defects of poor compatibility and flexibility when the existing data processing method extracts different types of data sources, the invention provides the following technical scheme:
one aspect of the present invention provides an unstructured data processing method, including:
processing the acquired unstructured data to acquire data in a target format;
identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
and generating a corresponding index for the structured data by calling an index generating tool of a preset server.
Optionally, the processing the acquired unstructured data to acquire data in a target format includes:
processing each unstructured data in the acquired multiple types of unstructured data to respectively acquire the data in the target format corresponding to each unstructured data.
Optionally, the method further comprises:
and uniformly reading the data in the target formats, and storing the read data locally.
Optionally, the preset server comprises an enterprise-level search application server SOLR.
Optionally, the method further comprises:
and processing each unstructured data in the acquired multiple types of unstructured data by adopting a synchronous or asynchronous remote procedure call protocol (RPC) calling method.
Optionally, the processing the acquired unstructured data further includes:
and if the file of the unstructured data is judged to be damaged or unreadable, repeatedly executing the operation of processing the acquired unstructured data after preset time.
Optionally, the method further comprises:
after the unstructured data are obtained, fault tolerance processing is carried out on the obtained unstructured data according to a constructed fault tolerance library;
the fault-tolerant library comprises all known processing rules and methods of unstructured data.
In another aspect, the present invention further provides an unstructured data processing apparatus, comprising:
the data processing unit is used for processing the acquired unstructured data to acquire data in a target format;
the content identification unit is used for identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
and the index generating unit is used for generating a corresponding index for the structured data by calling an index generating tool of a preset server.
Optionally, the data processing unit is specifically configured to process each of the obtained multiple types of unstructured data to obtain data in the target format corresponding to each of the unstructured data.
Optionally, the apparatus further comprises:
and the data storage unit is used for uniformly reading the data in the target formats and storing the read data in local.
According to the unstructured data processing method and device, the acquired unstructured data are processed to acquire data in a target format for content identification, the content of the data in the target format is identified, the data in the target format is converted into structured data according to an identification result, and then a corresponding index is generated for the structured data by calling an index generation tool of a preset server, so that unified access to various complex unstructured data sources can be achieved, and the efficiency and effectiveness of processing large-scale unstructured data can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow diagram illustrating an unstructured data processing method according to an embodiment of the invention;
FIG. 2 is a flow chart of an unstructured data processing method according to another embodiment of the invention;
FIG. 3 is a flow chart illustrating an unstructured data acquisition method according to an embodiment of the invention;
FIG. 4 is a flowchart illustrating a fault tolerance and retry method for an unstructured data acquisition process according to an embodiment of the present invention;
FIG. 5 is a block diagram of an unstructured data processing apparatus according to an embodiment of the invention;
fig. 6 is a schematic structural diagram of an electronic device implementing the unstructured data processing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an unstructured data processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
s1: processing the acquired unstructured data to acquire data in a target format;
specifically, the processor processes the acquired unstructured data, i.e., format converts the unstructured data, to acquire data in a target format that can be used for content recognition.
Further, as an optional implementation manner of this embodiment, step S1 may include:
processing each unstructured data in the acquired multiple types of unstructured data to respectively acquire the data in the target format corresponding to each unstructured data.
Further, as an optional implementation manner of this embodiment, after the acquiring the data in the target format corresponding to each piece of unstructured data, the method may further include:
and uniformly reading the data in the target formats, and storing the read data locally.
For example, the processor may process multiple types (e.g., Word, Domino, Pdf, Excel, etc.) of unstructured data sources at the same time to convert the multiple types of unstructured data sources into a uniform target format (e.g., XML format) capable of performing content identification, so as to mask differences between multiple types of unstructured data sources for acquisition, i.e., achieve low coupling for acquisition of multiple types of unstructured data sources.
S2: identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
specifically, the processor identifies the content of the acquired data in the target format, so as to convert the data in the target format into corresponding structured data according to the identification result.
For example, the processor intelligently understands the acquired data in the target format to determine what information each text corresponding to the data is about, classifies, quantifies, and further analyzes to discover the characteristics of the data itself, and converts the data into structured data to prepare for the subsequent index generation.
S3: and generating a corresponding index for the structured data by calling an index generating tool of a preset server.
As an optional implementation manner of this embodiment, the preset server may include, but is not limited to, an enterprise-level search application server SOLR.
Specifically, the processor generates a corresponding index in the SOLR by calling an index generation tool of the SOLR according to the structured data.
As an optional implementation manner of this embodiment, the SOLR in this embodiment may provide an API similar to Web-service externally, so that a user submits an XML file of a certain format to a search engine server through an http request, and generates an index; and then the user can also put forward a search request through the Http Get operation and obtain a return result in an XML format.
In the unstructured data processing method of this embodiment, the obtained unstructured data is processed to obtain data in a target format for content identification, the content of the data in the target format is identified, the data in the target format is converted into structured data according to an identification result, and then a corresponding index is generated for the structured data by calling an index generation tool of a preset server, so that unified access to various complex unstructured data sources can be achieved, and efficiency and effectiveness of processing large-scale unstructured data can be improved.
Further, as an optional implementation manner of the foregoing embodiment, the method further includes:
and processing each unstructured data in the acquired multiple types of unstructured data by adopting a synchronous or asynchronous remote procedure call protocol (RPC) calling method.
Specifically, in this embodiment, the processor may process each of the acquired multiple types of unstructured data by using a synchronous or asynchronous RPC calling method. For example, the processor automatically generates synchronous or asynchronous RPC calling method code of the unstructured data collection task according to the actual type of the unstructured data, and may automatically issue the code as a service to externally provide a service function for collecting the multiple types of unstructured data according to the SOA (service oriented architecture) principle.
Further, as an optional implementation manner of the foregoing embodiment, the processing the acquired unstructured data in step S1 may further include:
and if the file of the unstructured data is judged to be damaged or unreadable, repeatedly executing the operation of processing the acquired unstructured data after preset time.
Specifically, after the unstructured data is acquired, the processor determines the integrity, readability and the like of the acquired unstructured data, and if it is determined that the file itself of the unstructured data is damaged or unreadable, the processor repeatedly executes the operation of processing the acquired unstructured data after a preset time (for example, 5 minutes).
It should be noted that the processing judgment is to judge that the file itself of the unstructured data is damaged or unreadable, and may not be the true damage or unreadable of the file itself of the unstructured data, but may be a misjudgment that the data is unreadable due to the reason of the processor itself, so that by repeatedly executing the operation of processing the acquired unstructured data after a preset time, the quality of processing the unstructured data can be improved, and the effectiveness and efficiency of data processing can be ensured.
Further, as an optional implementation manner of the foregoing embodiment, the method may further include:
after the unstructured data are obtained, fault tolerance processing is carried out on the obtained unstructured data according to a constructed fault tolerance library; the fault-tolerant library comprises all known processing rules and methods of unstructured data.
Specifically, after the processor acquires the unstructured data, the processor may perform fault-tolerant processing on the acquired unstructured data according to a pre-constructed fault-tolerant library.
It will be appreciated that in reading the unstructured data, a number of complex problems may be encountered, such as coding problems, formatting problems, special symbol handling problems, and so on. Therefore, in the embodiment, by setting a fault tolerance mechanism, for example, configuring the processing rules of the problems of encoding, format, special symbols, and the like of the unstructured data of all formats and the corresponding processing methods and the like through a preset fault tolerance library, it is ensured that when processing large-scale unstructured data, the whole task is not suspended due to a small problem of a certain file.
Fig. 2 is a schematic flow chart of an unstructured data processing method according to another embodiment of the present invention, as shown in fig. 2, the method includes:
a1: a data agent layer in the processor processes various unstructured data sources, such as Word, Domino, Pdf, Excel and the like, and processes interfaces accessed by the data sources through corresponding adapters, so that the connection with a unified data acquisition platform in the processor is realized;
it can be understood that the differences of the unified data acquisition platform in processing various unstructured data sources can be shielded through the adapter, so that low coupling among a plurality of modules in the data agent layer is realized.
For example, the interface for reading word is processed by the adapter package, and then a data carrier (such as an XML file) with a uniform format is returned.
A2: a data acquisition layer in the processor uniformly reads XML files returned by a data agent layer, and the analyzed XML file data falls to the ground (namely is stored in a local storage space) on the platform;
it can be understood that the specific processes of data reading and parsing performed by the data acquisition layer may be flexibly configured according to actual needs, which is not limited by the present invention.
A3: a data analysis layer in the processor intelligently understands the data collected by the data collection layer to determine what information each text in the data collection layer relates to, and then classifies, quantifies and further analyzes the text;
it can be understood that intelligent understanding, classification, quantification and further analysis of the data by the data analysis layer help to discover the characteristics of the data itself, and then convert the data into structured data to prepare for the next index generation work.
A4: and the index generation layer in the processor calls an index generation tool of the SOLR according to the result provided by the index analysis layer to generate a corresponding index in the SOLR.
On this basis, fig. 3 is a schematic flowchart of an unstructured data acquisition method according to an embodiment of the present invention, and as shown in fig. 3, the method includes:
c1: an interface adaptation controller in the adapter calls a corresponding interface to read data according to the type of actual unstructured data;
further, as an optional implementation manner of this embodiment, the interface adaptation controller manages in a hot-pluggable manner, so that the corresponding interface can be freely extended according to actual needs.
C2: after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module;
c3: after the registration is carried out, the scheduling module triggers the code of the calling task to be connected with the corresponding unstructured file in the connection management module;
c4: after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface;
c5: after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;
further, as an optional implementation manner of the foregoing method embodiment, the method may further include:
c6: a retry mechanism is set in the adapter.
For example, when the interface adaptation controller calls the actual interface, if the unstructured data file itself is damaged or unreadable, the retry mechanism module calls the interface again to read the unstructured data file after 5 minutes.
Further, as an optional implementation manner of the foregoing method embodiment, the method may further include:
c7: a fault tolerance mechanism is provided in the adapter.
Considering that the unstructured data itself may encounter various complex problems such as coding problems, format problems, special symbol processing problems, etc. when reading. By presetting a fault-tolerant library in a fault-tolerant mechanism module and configuring processing rules of problems such as coding, format, special symbols and the like of all formats of unstructured data and corresponding processing methods and the like, the whole task can not be suspended due to the small problem of a certain file when large-scale unstructured data is processed.
Further, fig. 4 shows a flow of a fault tolerance and retry method in the unstructured data acquisition process according to an embodiment of the present invention, and as shown in fig. 4, the method includes:
the unstructured data firstly enter an error-tolerant layer, and are sequentially verified to pass through each specimen library (such as a coding specimen library, a format specimen library, a symbol specimen library and other preset specimen libraries);
for example, when passing through the coding specimen library, if the unstructured data does not encounter coding problems, then enter the next specimen library (format specimen library);
and if the code sample library does not pass through the code sample library, entering an abnormity judgment module.
On the basis, if the abnormal problem is a coding problem which is not found before (namely a new coding problem) and the coding problem has certain representativeness, the logic and definition of the coding abnormal processing can be generated in the coding specimen library, and then a coding retry module of a retry layer is called;
if the coding exception is a special problem and the exception has no representative meaning, the coding retry module of the retry layer is directly called after the file is modified according to the preset rule.
According to the scheme, the method of the embodiment can solve the problem that extraction of various unstructured data in the prior art is limited, various complex unstructured data sources are uniformly accessed through the adapter, the method and the corresponding system are clear in architecture and distinct in hierarchy, an optimized data source reading mode is considered in processing of unstructured data, a processing method for data entities in the unstructured data sources is emphasized, and effectiveness and efficiency of data reading are guaranteed; meanwhile, the adapter realizes hot-pluggable data processing, improves the flexibility of system configuration, and provides efficiency and effectiveness guarantee for processing of large-scale unstructured data sources by providing a fault-tolerant mechanism.
Fig. 5 is a schematic structural diagram of an unstructured data processing apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes a data processing unit 21, a content identifying unit 22, and an index generating unit 23, where:
the data processing unit 21 is configured to process the acquired unstructured data to acquire data in a target format;
the content identification unit 22 is used for identifying the content of the data in the target format so as to convert the data in the target format into structured data according to the identification result;
the index generating unit 23 is configured to generate a corresponding index for the structured data by invoking an index generating tool of a preset server.
Specifically, the process of the device of this embodiment for performing unstructured data processing includes: the data processing unit 21 processes the acquired unstructured data to acquire data in a target format; the content recognition unit 22 recognizes the content of the data in the target format to convert the data in the target format into structured data according to the recognition result; the index generating unit 23 generates a corresponding index for the structured data by calling an index generating tool of a preset server.
Further, as an optional implementation manner of the above apparatus embodiment, the data processing unit 21 may be specifically configured to process each of the obtained multiple types of unstructured data to obtain the data in the target format corresponding to each of the unstructured data, respectively.
Further, as an optional implementation of the above device embodiment, the device may further include:
and the data storage unit is used for uniformly reading the data in the target formats and storing the read data in local.
The unstructured data processing apparatus described in this embodiment may be used to execute the above-described unstructured data processing method embodiment, and the principle and technical effect are similar, which are not described herein again.
It should be noted that, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
Fig. 6 is a schematic structural diagram of an electronic device for implementing an unstructured data processing method according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: a processor (processor)31, a bus 32 and a memory (memory)33, wherein the processor (processor)31 and the memory 33 communicate with each other through the bus 32. The processor 31 may call program instructions in the memory 33 to perform the following method:
processing the acquired unstructured data to acquire data in a target format;
identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
and generating a corresponding index for the structured data by calling an index generating tool of a preset server.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising:
processing the acquired unstructured data to acquire data in a target format;
identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
and generating a corresponding index for the structured data by calling an index generating tool of a preset server.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including:
processing the acquired unstructured data to acquire data in a target format;
identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
and generating a corresponding index for the structured data by calling an index generating tool of a preset server.
The invention realizes the service priority identification through the DSCP, compared with QCI grade, the distinguishable service is more, and the matching and the butt joint with the upper network element are more convenient. The scheme is provided based on the existing network scheme of China Mobile and service development, the feasibility and the landing performance of the scheme and the requirements on other network elements have obvious advantages compared with the existing scheme, the service level differentiated scheduling can be efficiently realized, the actual conditions of the existing scheme and the current LTE network development are fully considered, the existing network is slightly changed, and the rapid and smooth evolution is really realized.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. An unstructured data processing method, characterized by comprising:
processing the acquired unstructured data to acquire data in a target format;
identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
generating a corresponding index for the structured data by calling an index generation tool of a preset server;
the method further comprises an unstructured data acquisition step:
an interface adaptation controller in the adapter calls a corresponding interface to read data according to the type of actual unstructured data;
after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module;
after the registration is finished, the scheduling module triggers the code of the calling task to be connected with corresponding unstructured data in a connection management module;
after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface;
after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;
the method further comprises the following steps:
after the unstructured data are obtained, fault tolerance processing is carried out on the obtained unstructured data according to a constructed fault tolerance library;
the fault-tolerant database comprises all known processing rules and methods for unstructured data of all types, and specifically comprises the following steps:
the unstructured data firstly enter a fault-tolerant layer, and are sequentially verified to pass through each sample library in the fault-tolerant library;
if the specimen does not pass through the specimen library, entering an abnormity judgment module;
if the abnormal problem is a new coding problem and the new coding problem has certain representativeness, generating the logic and definition of the processing of the new coding problem in the specimen library, and then calling a coding retry module of a retry layer; and if the new coding problem is a special problem and has no representative meaning, the coding retry module of the retry layer is directly called after the unstructured data is modified according to the preset rule.
2. The method of claim 1, wherein processing the obtained unstructured data to obtain data in a target format comprises:
processing each unstructured data in the acquired multiple types of unstructured data to respectively acquire the data in the target format corresponding to each unstructured data.
3. The method of claim 2, further comprising:
and uniformly reading the data in the target formats, and storing the read data locally.
4. The method of claim 1, wherein the predetermined server comprises an enterprise-level search application Server (SOLR).
5. The method of any of claims 1-4, wherein the processing the acquired unstructured data further comprises:
and if the file of the unstructured data is judged to be damaged or unreadable, repeatedly executing the operation of processing the acquired unstructured data after preset time.
6. An unstructured data processing apparatus, comprising:
the data processing unit is used for processing the acquired unstructured data to acquire data in a target format;
the content identification unit is used for identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;
the index generating unit is used for generating a corresponding index for the structured data by calling an index generating tool of a preset server;
the device also comprises an unstructured data acquisition unit, a data reading unit and a data reading unit, wherein the unstructured data acquisition unit is used for calling a corresponding interface to read data by an interface adaptation controller in the adapter according to the type of actual unstructured data; after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module; after the registration is finished, the scheduling module triggers the code of the calling task to be connected with corresponding unstructured data in a connection management module; after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface; after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;
the data processing unit is further configured to, after acquiring the unstructured data, perform fault-tolerant processing on the acquired unstructured data according to a constructed fault-tolerant library;
the fault-tolerant database comprises all known processing rules and methods for unstructured data of all types, and specifically comprises the following steps:
the unstructured data firstly enter a fault-tolerant layer, and are sequentially verified to pass through each sample library in the fault-tolerant library;
if the specimen does not pass through the specimen library, entering an abnormity judgment module;
if the abnormal problem is a new coding problem and the new coding problem has certain representativeness, generating the logic and definition of the processing of the new coding problem in the specimen library, and then calling a coding retry module of a retry layer; and if the new coding problem is a special problem and has no representative meaning, the coding retry module of the retry layer is directly called after the unstructured data is modified according to the preset rule.
7. The apparatus according to claim 6, wherein the data processing unit is specifically configured to process each of the obtained multiple types of unstructured data to obtain the data in the target format corresponding to each of the unstructured data.
8. The apparatus of claim 7, further comprising:
and the data storage unit is used for uniformly reading the data in the target formats and storing the read data in local.
CN201611197679.3A 2016-12-22 2016-12-22 Unstructured data processing method and device Active CN108228664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611197679.3A CN108228664B (en) 2016-12-22 2016-12-22 Unstructured data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611197679.3A CN108228664B (en) 2016-12-22 2016-12-22 Unstructured data processing method and device

Publications (2)

Publication Number Publication Date
CN108228664A CN108228664A (en) 2018-06-29
CN108228664B true CN108228664B (en) 2021-02-09

Family

ID=62656840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611197679.3A Active CN108228664B (en) 2016-12-22 2016-12-22 Unstructured data processing method and device

Country Status (1)

Country Link
CN (1) CN108228664B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109167724B (en) * 2018-09-10 2021-09-17 四川虹微技术有限公司 Method for realizing data preprocessing in API gateway
CN110275966B (en) * 2019-07-01 2021-10-01 科大讯飞(苏州)科技有限公司 Knowledge extraction method and device
CN111488333B (en) * 2020-05-18 2023-07-11 网易(杭州)网络有限公司 Data processing method and device, storage medium and electronic equipment
CN112883096B (en) * 2021-03-11 2024-04-30 广东工业大学 Data preprocessing method
CN114116935B (en) * 2021-11-17 2023-03-17 北京中知智慧科技有限公司 Method and system for retrieving geographic marker

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908178A (en) * 2010-08-13 2010-12-08 广州联奕信息科技有限公司 Middleware applied to data switching and data switching method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103279506A (en) * 2013-05-15 2013-09-04 云南电力试验研究院(集团)有限公司电力研究院 Method for extracting journal paper unstructured data based on electric power technology
CN104239506A (en) * 2014-09-12 2014-12-24 北京优特捷信息技术有限公司 Unstructured data processing method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908178A (en) * 2010-08-13 2010-12-08 广州联奕信息科技有限公司 Middleware applied to data switching and data switching method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
非结构化到结构化数据转换的研究与实现;万里鹏;《万方学位论文数据库》;20131030;摘要,第7页-第10页,第57页-第66页 *

Also Published As

Publication number Publication date
CN108228664A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN108228664B (en) Unstructured data processing method and device
CN109961204B (en) Service quality analysis method and system under micro-service architecture
CN109254982A (en) A kind of stream data processing method, system, device and computer readable storage medium
KR102634058B1 (en) Input and output schema mapping
US9418241B2 (en) Unified platform for big data processing
CN111309550A (en) Data acquisition method, system, equipment and storage medium of application program
US20190278590A1 (en) Automated generation of service definitions for message queue application clients
CN111079408A (en) Language identification method, device, equipment and storage medium
CN106777265B (en) Service data processing method and device
CN111753070A (en) System and method for processing server monitoring log
CN110688383A (en) Data acquisition method and system
CN108073705B (en) Distributed mass data aggregation acquisition method
CN112883088B (en) Data processing method, device, equipment and storage medium
CN105224420A (en) A kind of analytical approach of automatic parsing terminal abnormal and system
CN110727565B (en) Network equipment platform information collection method and system
CN112579406A (en) Log call chain generation method and device
CN112579552A (en) Log storage and calling method, device and system
CN116204428A (en) Test case generation method and device
CN115604343A (en) Data transmission method, system, electronic equipment and storage medium
CN113704203A (en) Log file processing method and device
CN113779026A (en) Method and device for processing service data table
CN112214669A (en) Home decoration material formaldehyde release data processing method and device and monitoring server
CN115052035B (en) Message pushing method, device and storage medium
CN110806961A (en) Intelligent early warning method and system and recommendation system
CN106909570B (en) Data conversion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant