CN108228664B

CN108228664B - Unstructured data processing method and device

Info

Publication number: CN108228664B
Application number: CN201611197679.3A
Authority: CN
Inventors: 陈毅
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd
Priority date: 2016-12-22
Filing date: 2016-12-22
Publication date: 2021-02-09
Anticipated expiration: 2036-12-22
Also published as: CN108228664A

Abstract

The invention relates to an unstructured data processing method and device, wherein the method comprises the following steps: processing the acquired unstructured data to acquire data in a target format; identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result; and generating a corresponding index for the structured data by calling an index generating tool of a preset server. The unstructured data processing method and the unstructured data processing device can realize unified access to various complex unstructured data sources, enhance the processing method of data entities in the unstructured data sources, ensure the effectiveness and efficiency of data reading, realize hot-pluggable data processing, improve the flexibility of system configuration, and provide efficiency and effectiveness guarantee for processing large-scale unstructured data sources by providing a fault-tolerant mechanism.

Description

Unstructured data processing method and device

Technical Field

The present invention relates to the field of data service technologies, and in particular, to a method and an apparatus for processing unstructured data.

Background

Currently, the data processing method mainly includes an application programming interface API, a data warehouse technology ETL, a DB data interface, and a Message Queue (MQ). Specifically, the API is to develop an interface program between applications, extract original data according to business logic by using a communication protocol (such as SOAP, HTTP, and the like), and write the extracted data into a target database; ETL is to establish a data channel between a data source and a data target by using the existing and mature tools, and to import data from a source library into a target library through a data engine; the DB interface is arranged between relational databases, configures database connection, reads original data according to database table entries, and inserts results into a target database; message Queues (MQs) encapsulate data by means of messages, and send message data to target data by means of queues.

However, the above data processing methods have certain disadvantages when performing data extraction. Specifically, since the API adopts a hard coding mode, the code flexibility is insufficient, the code coupling is high, the change cost is high, and the processing of resources such as rich text objects, xml/HTTP and the like is complex; the existing ETL tool has higher development and use threshold, higher development complexity and insufficient development flexibility, and has poorer support for resources such as rich text objects, xml/HTTP and the like; the DB interface is realized at the bottom layer of the database, when the DB interface is used for processing complex problems, the expansibility is poor, the DB interface is mainly limited in the same database, the heterogeneous database is difficult, and the support for resources such as rich text objects, xml/HTTP and the like is poor; message Queues (MQ) are characterized by asynchronous processing, and therefore cannot meet the problem of high real-time requirements, and support for rich text objects, xml/HTTP and other resources is poor.

Therefore, the existing data processing method has poor compatibility and flexibility when extracting different types of data sources.

Disclosure of Invention

Aiming at the defects of poor compatibility and flexibility when the existing data processing method extracts different types of data sources, the invention provides the following technical scheme:

one aspect of the present invention provides an unstructured data processing method, including:

processing the acquired unstructured data to acquire data in a target format;

identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;

and generating a corresponding index for the structured data by calling an index generating tool of a preset server.

Optionally, the processing the acquired unstructured data to acquire data in a target format includes:

processing each unstructured data in the acquired multiple types of unstructured data to respectively acquire the data in the target format corresponding to each unstructured data.

Optionally, the method further comprises:

and uniformly reading the data in the target formats, and storing the read data locally.

Optionally, the preset server comprises an enterprise-level search application server SOLR.

Optionally, the method further comprises:

and processing each unstructured data in the acquired multiple types of unstructured data by adopting a synchronous or asynchronous remote procedure call protocol (RPC) calling method.

Optionally, the processing the acquired unstructured data further includes:

and if the file of the unstructured data is judged to be damaged or unreadable, repeatedly executing the operation of processing the acquired unstructured data after preset time.

Optionally, the method further comprises:

after the unstructured data are obtained, fault tolerance processing is carried out on the obtained unstructured data according to a constructed fault tolerance library;

the fault-tolerant library comprises all known processing rules and methods of unstructured data.

In another aspect, the present invention further provides an unstructured data processing apparatus, comprising:

the data processing unit is used for processing the acquired unstructured data to acquire data in a target format;

the content identification unit is used for identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;

and the index generating unit is used for generating a corresponding index for the structured data by calling an index generating tool of a preset server.

Optionally, the data processing unit is specifically configured to process each of the obtained multiple types of unstructured data to obtain data in the target format corresponding to each of the unstructured data.

Optionally, the apparatus further comprises:

and the data storage unit is used for uniformly reading the data in the target formats and storing the read data in local.

According to the unstructured data processing method and device, the acquired unstructured data are processed to acquire data in a target format for content identification, the content of the data in the target format is identified, the data in the target format is converted into structured data according to an identification result, and then a corresponding index is generated for the structured data by calling an index generation tool of a preset server, so that unified access to various complex unstructured data sources can be achieved, and the efficiency and effectiveness of processing large-scale unstructured data can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating an unstructured data processing method according to an embodiment of the invention;

FIG. 2 is a flow chart of an unstructured data processing method according to another embodiment of the invention;

FIG. 3 is a flow chart illustrating an unstructured data acquisition method according to an embodiment of the invention;

FIG. 4 is a flowchart illustrating a fault tolerance and retry method for an unstructured data acquisition process according to an embodiment of the present invention;

FIG. 5 is a block diagram of an unstructured data processing apparatus according to an embodiment of the invention;

fig. 6 is a schematic structural diagram of an electronic device implementing the unstructured data processing method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of an unstructured data processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

s1: processing the acquired unstructured data to acquire data in a target format;

specifically, the processor processes the acquired unstructured data, i.e., format converts the unstructured data, to acquire data in a target format that can be used for content recognition.

Further, as an optional implementation manner of this embodiment, step S1 may include:

Further, as an optional implementation manner of this embodiment, after the acquiring the data in the target format corresponding to each piece of unstructured data, the method may further include:

For example, the processor may process multiple types (e.g., Word, Domino, Pdf, Excel, etc.) of unstructured data sources at the same time to convert the multiple types of unstructured data sources into a uniform target format (e.g., XML format) capable of performing content identification, so as to mask differences between multiple types of unstructured data sources for acquisition, i.e., achieve low coupling for acquisition of multiple types of unstructured data sources.

S2: identifying the content of the data in the target format so as to convert the data in the target format into structured data according to an identification result;

specifically, the processor identifies the content of the acquired data in the target format, so as to convert the data in the target format into corresponding structured data according to the identification result.

For example, the processor intelligently understands the acquired data in the target format to determine what information each text corresponding to the data is about, classifies, quantifies, and further analyzes to discover the characteristics of the data itself, and converts the data into structured data to prepare for the subsequent index generation.

S3: and generating a corresponding index for the structured data by calling an index generating tool of a preset server.

As an optional implementation manner of this embodiment, the preset server may include, but is not limited to, an enterprise-level search application server SOLR.

Specifically, the processor generates a corresponding index in the SOLR by calling an index generation tool of the SOLR according to the structured data.

As an optional implementation manner of this embodiment, the SOLR in this embodiment may provide an API similar to Web-service externally, so that a user submits an XML file of a certain format to a search engine server through an http request, and generates an index; and then the user can also put forward a search request through the Http Get operation and obtain a return result in an XML format.

In the unstructured data processing method of this embodiment, the obtained unstructured data is processed to obtain data in a target format for content identification, the content of the data in the target format is identified, the data in the target format is converted into structured data according to an identification result, and then a corresponding index is generated for the structured data by calling an index generation tool of a preset server, so that unified access to various complex unstructured data sources can be achieved, and efficiency and effectiveness of processing large-scale unstructured data can be improved.

Further, as an optional implementation manner of the foregoing embodiment, the method further includes:

Specifically, in this embodiment, the processor may process each of the acquired multiple types of unstructured data by using a synchronous or asynchronous RPC calling method. For example, the processor automatically generates synchronous or asynchronous RPC calling method code of the unstructured data collection task according to the actual type of the unstructured data, and may automatically issue the code as a service to externally provide a service function for collecting the multiple types of unstructured data according to the SOA (service oriented architecture) principle.

Further, as an optional implementation manner of the foregoing embodiment, the processing the acquired unstructured data in step S1 may further include:

Specifically, after the unstructured data is acquired, the processor determines the integrity, readability and the like of the acquired unstructured data, and if it is determined that the file itself of the unstructured data is damaged or unreadable, the processor repeatedly executes the operation of processing the acquired unstructured data after a preset time (for example, 5 minutes).

It should be noted that the processing judgment is to judge that the file itself of the unstructured data is damaged or unreadable, and may not be the true damage or unreadable of the file itself of the unstructured data, but may be a misjudgment that the data is unreadable due to the reason of the processor itself, so that by repeatedly executing the operation of processing the acquired unstructured data after a preset time, the quality of processing the unstructured data can be improved, and the effectiveness and efficiency of data processing can be ensured.

Further, as an optional implementation manner of the foregoing embodiment, the method may further include:

after the unstructured data are obtained, fault tolerance processing is carried out on the obtained unstructured data according to a constructed fault tolerance library; the fault-tolerant library comprises all known processing rules and methods of unstructured data.

Specifically, after the processor acquires the unstructured data, the processor may perform fault-tolerant processing on the acquired unstructured data according to a pre-constructed fault-tolerant library.

It will be appreciated that in reading the unstructured data, a number of complex problems may be encountered, such as coding problems, formatting problems, special symbol handling problems, and so on. Therefore, in the embodiment, by setting a fault tolerance mechanism, for example, configuring the processing rules of the problems of encoding, format, special symbols, and the like of the unstructured data of all formats and the corresponding processing methods and the like through a preset fault tolerance library, it is ensured that when processing large-scale unstructured data, the whole task is not suspended due to a small problem of a certain file.

Fig. 2 is a schematic flow chart of an unstructured data processing method according to another embodiment of the present invention, as shown in fig. 2, the method includes:

a1: a data agent layer in the processor processes various unstructured data sources, such as Word, Domino, Pdf, Excel and the like, and processes interfaces accessed by the data sources through corresponding adapters, so that the connection with a unified data acquisition platform in the processor is realized;

it can be understood that the differences of the unified data acquisition platform in processing various unstructured data sources can be shielded through the adapter, so that low coupling among a plurality of modules in the data agent layer is realized.

For example, the interface for reading word is processed by the adapter package, and then a data carrier (such as an XML file) with a uniform format is returned.

A2: a data acquisition layer in the processor uniformly reads XML files returned by a data agent layer, and the analyzed XML file data falls to the ground (namely is stored in a local storage space) on the platform;

it can be understood that the specific processes of data reading and parsing performed by the data acquisition layer may be flexibly configured according to actual needs, which is not limited by the present invention.

A3: a data analysis layer in the processor intelligently understands the data collected by the data collection layer to determine what information each text in the data collection layer relates to, and then classifies, quantifies and further analyzes the text;

it can be understood that intelligent understanding, classification, quantification and further analysis of the data by the data analysis layer help to discover the characteristics of the data itself, and then convert the data into structured data to prepare for the next index generation work.

A4: and the index generation layer in the processor calls an index generation tool of the SOLR according to the result provided by the index analysis layer to generate a corresponding index in the SOLR.

On this basis, fig. 3 is a schematic flowchart of an unstructured data acquisition method according to an embodiment of the present invention, and as shown in fig. 3, the method includes:

c1: an interface adaptation controller in the adapter calls a corresponding interface to read data according to the type of actual unstructured data;

further, as an optional implementation manner of this embodiment, the interface adaptation controller manages in a hot-pluggable manner, so that the corresponding interface can be freely extended according to actual needs.

C2: after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module;

c3: after the registration is carried out, the scheduling module triggers the code of the calling task to be connected with the corresponding unstructured file in the connection management module;

c4: after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface;

c5: after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;

further, as an optional implementation manner of the foregoing method embodiment, the method may further include:

c6: a retry mechanism is set in the adapter.

For example, when the interface adaptation controller calls the actual interface, if the unstructured data file itself is damaged or unreadable, the retry mechanism module calls the interface again to read the unstructured data file after 5 minutes.

c7: a fault tolerance mechanism is provided in the adapter.

Considering that the unstructured data itself may encounter various complex problems such as coding problems, format problems, special symbol processing problems, etc. when reading. By presetting a fault-tolerant library in a fault-tolerant mechanism module and configuring processing rules of problems such as coding, format, special symbols and the like of all formats of unstructured data and corresponding processing methods and the like, the whole task can not be suspended due to the small problem of a certain file when large-scale unstructured data is processed.

Further, fig. 4 shows a flow of a fault tolerance and retry method in the unstructured data acquisition process according to an embodiment of the present invention, and as shown in fig. 4, the method includes:

the unstructured data firstly enter an error-tolerant layer, and are sequentially verified to pass through each specimen library (such as a coding specimen library, a format specimen library, a symbol specimen library and other preset specimen libraries);

for example, when passing through the coding specimen library, if the unstructured data does not encounter coding problems, then enter the next specimen library (format specimen library);

and if the code sample library does not pass through the code sample library, entering an abnormity judgment module.

On the basis, if the abnormal problem is a coding problem which is not found before (namely a new coding problem) and the coding problem has certain representativeness, the logic and definition of the coding abnormal processing can be generated in the coding specimen library, and then a coding retry module of a retry layer is called;

if the coding exception is a special problem and the exception has no representative meaning, the coding retry module of the retry layer is directly called after the file is modified according to the preset rule.

According to the scheme, the method of the embodiment can solve the problem that extraction of various unstructured data in the prior art is limited, various complex unstructured data sources are uniformly accessed through the adapter, the method and the corresponding system are clear in architecture and distinct in hierarchy, an optimized data source reading mode is considered in processing of unstructured data, a processing method for data entities in the unstructured data sources is emphasized, and effectiveness and efficiency of data reading are guaranteed; meanwhile, the adapter realizes hot-pluggable data processing, improves the flexibility of system configuration, and provides efficiency and effectiveness guarantee for processing of large-scale unstructured data sources by providing a fault-tolerant mechanism.

Fig. 5 is a schematic structural diagram of an unstructured data processing apparatus according to an embodiment of the present invention, as shown in fig. 5, the apparatus includes a data processing unit 21, a content identifying unit 22, and an index generating unit 23, where:

the data processing unit 21 is configured to process the acquired unstructured data to acquire data in a target format;

the content identification unit 22 is used for identifying the content of the data in the target format so as to convert the data in the target format into structured data according to the identification result;

the index generating unit 23 is configured to generate a corresponding index for the structured data by invoking an index generating tool of a preset server.

Specifically, the process of the device of this embodiment for performing unstructured data processing includes: the data processing unit 21 processes the acquired unstructured data to acquire data in a target format; the content recognition unit 22 recognizes the content of the data in the target format to convert the data in the target format into structured data according to the recognition result; the index generating unit 23 generates a corresponding index for the structured data by calling an index generating tool of a preset server.

Further, as an optional implementation manner of the above apparatus embodiment, the data processing unit 21 may be specifically configured to process each of the obtained multiple types of unstructured data to obtain the data in the target format corresponding to each of the unstructured data, respectively.

Further, as an optional implementation of the above device embodiment, the device may further include:

The unstructured data processing apparatus described in this embodiment may be used to execute the above-described unstructured data processing method embodiment, and the principle and technical effect are similar, which are not described herein again.

It should be noted that, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

Fig. 6 is a schematic structural diagram of an electronic device for implementing an unstructured data processing method according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: a processor (processor)31, a bus 32 and a memory (memory)33, wherein the processor (processor)31 and the memory 33 communicate with each other through the bus 32. The processor 31 may call program instructions in the memory 33 to perform the following method:

processing the acquired unstructured data to acquire data in a target format;

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising:

processing the acquired unstructured data to acquire data in a target format;

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including:

processing the acquired unstructured data to acquire data in a target format;

The invention realizes the service priority identification through the DSCP, compared with QCI grade, the distinguishable service is more, and the matching and the butt joint with the upper network element are more convenient. The scheme is provided based on the existing network scheme of China Mobile and service development, the feasibility and the landing performance of the scheme and the requirements on other network elements have obvious advantages compared with the existing scheme, the service level differentiated scheduling can be efficiently realized, the actual conditions of the existing scheme and the current LTE network development are fully considered, the existing network is slightly changed, and the rapid and smooth evolution is really realized.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An unstructured data processing method, characterized by comprising:

processing the acquired unstructured data to acquire data in a target format;

generating a corresponding index for the structured data by calling an index generation tool of a preset server;

the method further comprises an unstructured data acquisition step:

an interface adaptation controller in the adapter calls a corresponding interface to read data according to the type of actual unstructured data;

after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module;

after the registration is finished, the scheduling module triggers the code of the calling task to be connected with corresponding unstructured data in a connection management module;

after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface;

after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;

the method further comprises the following steps:

the fault-tolerant database comprises all known processing rules and methods for unstructured data of all types, and specifically comprises the following steps:

the unstructured data firstly enter a fault-tolerant layer, and are sequentially verified to pass through each sample library in the fault-tolerant library;

if the specimen does not pass through the specimen library, entering an abnormity judgment module;

if the abnormal problem is a new coding problem and the new coding problem has certain representativeness, generating the logic and definition of the processing of the new coding problem in the specimen library, and then calling a coding retry module of a retry layer; and if the new coding problem is a special problem and has no representative meaning, the coding retry module of the retry layer is directly called after the unstructured data is modified according to the preset rule.

2. The method of claim 1, wherein processing the obtained unstructured data to obtain data in a target format comprises:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein the predetermined server comprises an enterprise-level search application Server (SOLR).

5. The method of any of claims 1-4, wherein the processing the acquired unstructured data further comprises:

6. An unstructured data processing apparatus, comprising:

the index generating unit is used for generating a corresponding index for the structured data by calling an index generating tool of a preset server;

the device also comprises an unstructured data acquisition unit, a data reading unit and a data reading unit, wherein the unstructured data acquisition unit is used for calling a corresponding interface to read data by an interface adaptation controller in the adapter according to the type of actual unstructured data; after the interface is called, the interface adaptive controller generates a calling record which represents that the calling task is finished, and registers the calling record in the scheduling module; after the registration is finished, the scheduling module triggers the code of the calling task to be connected with corresponding unstructured data in a connection management module; after the connection is established, the adapter automatically generates a synchronous or asynchronous RPC calling method code of the unstructured data acquisition task according to the actual condition of an interface; after the RPC calling method codes are generated, the adapter automatically issues the codes of the tasks into services in the management monitoring module, and provides service functions to the outside according to the SOA principle;

the data processing unit is further configured to, after acquiring the unstructured data, perform fault-tolerant processing on the acquired unstructured data according to a constructed fault-tolerant library;

7. The apparatus according to claim 6, wherein the data processing unit is specifically configured to process each of the obtained multiple types of unstructured data to obtain the data in the target format corresponding to each of the unstructured data.

8. The apparatus of claim 7, further comprising: