CN111352907A

CN111352907A - Method and device for analyzing pipeline file, computer equipment and storage medium

Info

Publication number: CN111352907A
Application number: CN202010237463.5A
Authority: CN
Inventors: 何川; 杨庆; 王晓青
Original assignee: Seezhi Data Technology Shanghai Co ltd
Current assignee: Seezhi Data Technology Shanghai Co ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-06-30

Abstract

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for parsing a running file, a computer device, and a storage medium. The method comprises the following steps: acquiring a streaming file downloaded from a server; reading a first preset number of preset rows in the stream file, and extracting header rows from the read preset rows according to preset extraction logic; extracting text information corresponding to each field in the header line; matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, wherein the standard fields and the non-standard fields are stored in the dictionary library in an associated manner; and extracting the pipeline data of the pipeline file according to the standard field. The method has compatibility and can analyze various running files.

Description

Method and device for analyzing pipeline file, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for parsing a running file, a computer device, and a storage medium.

Background

At present, the online banking systems of various bank enterprises are mature, each bank has a respective online banking system, one important function is downloading the running statement file of the enterprise, and the running statement file of the bank acquired by the enterprise is different from the file format, the file content, the field name and the like due to the inconsistency of the various bank systems.

However, the basic flow identification technology is programmed and parsed based on a specific bank flow template and a specific file format, and is written into software by hard coding or the like. Therefore, the computer device must acquire the pipeline file template of the bank in advance, read the field at a specific position in the file, and then find out the pipeline field information required by the service system according to the corresponding field relationship, because the position must be preset, the flexibility is poor, the compatibility of different banks is not enough, coding needs to be carried out again, and the iteration cost is high.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for parsing a pipeline file with compatibility.

A method for parsing a pipeline file, the method comprising:

acquiring a streaming file downloaded from a server;

reading a first preset number of preset rows in the stream file, and extracting header rows from the read preset rows according to preset extraction logic;

extracting text information corresponding to each field in the header line;

matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, wherein the standard fields and the non-standard fields are stored in the dictionary library in an associated manner;

and extracting the pipeline data of the pipeline file according to the standard field.

In one embodiment, the method further comprises:

generating an analysis template according to the corresponding relation between the text information and the standard field, and storing the generated analysis template;

after reading the preset lines of the first preset number in the stream file, the method further comprises:

matching the content in each preset row with a table header in a prestored analysis template;

and if the preset row matched with the header does not exist, continuously extracting the header row from the read preset row according to the field quantity.

In one embodiment, the parsing template further includes predefined parsing templates respectively corresponding to the plurality of servers, and the method further includes:

and if the preset line matched with the header exists, performing running data extraction on the running file through the analysis template.

In one embodiment, after the extracting the pipeline data from the pipeline file according to the standard field, the method further includes:

extracting first to-be-processed data corresponding to a date field from the extracted running data;

and acquiring the digital characteristics and the separator characteristics of the pre-stored date field, and processing the first data to be processed according to the digital characteristics and the separator characteristics of the date field to obtain the date.

extracting second data to be processed corresponding to the balance related field from the extracted running data;

grouping the second data to be processed according to time;

calculating whether the running balance field in each group is matched with second data to be processed corresponding to the transaction amount field;

if the balance is matched with the transaction amount field, calculating the initial balance and the final balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group;

and if not, outputting prompt information.

extracting third data to be processed corresponding to the multiple remark related fields from the extracted running water data;

and merging the third data to be processed to obtain a transaction description remark field.

In one embodiment, after obtaining the streaming file downloaded from the server, the method further includes:

identifying a format of the pipeline file;

acquiring a data extraction engine corresponding to the identified format;

and performing data extraction on the pipeline file through the acquired data extraction engine to obtain the pipeline file stored in a two-dimensional array data format.

In one embodiment, the identifying the format of the pipeline file includes:

reading a second preset number of characters in the stream file;

and judging whether the number of first preset characters in the characters is larger than a first preset value or not, and if so, determining that the format of the streaming file is a binary format.

In one embodiment, after determining whether the number of preset characters in the characters is greater than a preset value, the method further includes:

if the number of first preset characters in the characters is not larger than the first preset value, performing coding prediction on the read characters to obtain processing codes;

performing preliminary analysis on the pipeline file through the processing codes;

and judging whether the number of second preset characters in the primarily analyzed flow file is greater than a second preset value or not, if so, determining that the flow file is in an HTML/XML format, and otherwise, determining that the flow file is in a CSV text file.

In one embodiment, the obtaining of the processing code by performing coding prediction on the read character includes:

analyzing the read character through a plurality of codes to be selected in a preset assembly to obtain confidence degrees corresponding to the plurality of codes to be selected;

and selecting the code to be selected with the highest confidence coefficient as a processing code.

In one embodiment, after the analyzing the read character through the multiple codes to be selected in the preset component to obtain the confidence degrees corresponding to the multiple codes to be selected, the method further includes:

judging whether Chinese codes exist in the codes to be selected with the confidence degree sequence in the front preset bits or not;

if so, taking the Chinese codes with the confidence degrees sequenced at the front preset bits as processing codes, and otherwise, continuously selecting the codes to be selected with the highest confidence degrees as the processing codes.

A pipelined file parsing apparatus, the apparatus comprising:

the downloading module is used for acquiring the streaming file downloaded from the server;

the header determining module is used for reading a first preset number of preset rows in the pipeline file and extracting header rows from the read preset rows according to preset extraction logic;

the first extraction module is used for extracting text information corresponding to each field in the header line;

the field determining module is used for matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, and the standard fields and the non-standard fields are stored in the dictionary library in an associated manner;

and the second extraction module is used for extracting the pipeline data of the pipeline file according to the standard field.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the method of any of the above when the computer program is executed.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above.

According to the method, the device, the computer equipment and the storage medium for analyzing the running files, after the running files are obtained, only the first preset number of preset rows can be extracted, the header rows are determined from the preset rows through the preset extraction logic, so that the attributes, namely the standard fields, represented by the fields in the header rows can be determined according to the text information corresponding to the fields in the header rows and the dictionary library, the running files can be extracted according to the standard fields, the downloading of different running files can be realized without setting additional templates and the like, and the method, the device, the computer equipment and the storage medium have compatibility and higher flexibility and do not need encoding again.

Drawings

FIG. 1 is a diagram of an application environment of a method for parsing a pipeline file in an embodiment;

FIG. 2 is a flowchart illustrating a method for parsing an assembly line file according to an embodiment;

FIG. 3 is a diagram of a parse template in one embodiment;

FIG. 4 is a schematic diagram of a pipelined file parsing process in one embodiment;

FIG. 5 is a block diagram of an apparatus for parsing a pipelined file in one embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method for analyzing the pipeline file provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may download the pipeline file from the server, then read a first preset number of preset lines in the pipeline file, and extract the header lines from the read preset lines according to the preset extraction logic, so that text information of each field in the header lines may be obtained, and then matched with the non-standard fields in the dictionary library stored in advance, so that the standard fields of each field, that is, the meaning thereof, may be determined, and thus the terminal 102 may extract the pipeline data of the pipeline file according to the standard fields, and such an analysis process may implement downloading of different pipeline files without setting additional templates and the like, and has compatibility, higher flexibility, and no need of encoding again, etc. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In an embodiment, as shown in fig. 2, a method for parsing a pipeline file is provided, which is described by taking the method as an example of being applied to the terminal in fig. 1, and includes the following steps:

s202: and acquiring the streaming file downloaded from the server.

Specifically, the running file is a file downloaded by the terminal from the server, and the server may be a server corresponding to each target bank, so that the running file is a corresponding running file downloaded from the server of each bank, where optionally, the running file may be a file download request sent by the terminal to the server first, where the file download request carries a user identifier, so that the server queries the corresponding running file according to the user identifier, for example, the running file according to a date corresponding to the file download request, and then sends the running file to the terminal. Optionally, after the terminal receives the pipeline file, the pipeline file may be cached first, so that when the terminal needs to process the part of the pipeline file, the pipeline file is read from the cache.

S204: reading a first preset number of preset rows in the stream file, and extracting header rows from the read preset rows according to preset extraction logic.

Specifically, the first preset number may be set by the user, for example, the first 50 rows, and the terminal reads the first 50 rows of the running water file for analysis. The preset extraction logic is preset logic that can extract the header row from the preset row.

The terminal may first read the first 50 rows from the pipeline file, then judge and extract candidate rows sequentially or in parallel on the 50 rows, for example, when a certain row does not contain any data and does not contain preset characters, such as a colon, and the number of non-empty fields is greater than the number of preset fields, such as greater than 3, then take the preset row as the candidate row, and then the terminal selects the row with the largest number of non-empty fields in the candidate row as the header row.

S206: and extracting text information corresponding to each field in the header line.

Specifically, the text information refers to the content of each field, for example, there are a plurality of fields in the header line, the text content in each field is the text information, for example, some field contents are "transaction date", some field contents are "balance", and the like.

S208: and matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, wherein the standard fields and the non-standard fields are stored in the dictionary library in an associated manner.

Specifically, the dictionary repository is a database or a data table or the like storing association relations between standard fields and non-standard fields, where the standard fields refer to standard expressions of each key field in the bank flow file, such as the own bank, the own account number, the own account name, the transaction date, the transaction time, the transaction amount, the currency/income or expenditure identifier, the counter bank, the counter account number, the counter bank, the balance, the transaction description remarks and the transaction type, and the non-standard fields refer to non-standard expressions of each key field in the bank flow file, such as the own bank sometimes referred to as a transaction bank.

The terminal matches the text information with the non-standard fields in the pre-stored dictionary library to determine the standard fields corresponding to the text information, so that the data attributes corresponding to each text information can be determined, the data attributes corresponding to each column of data can be determined, and the stream data in the corresponding stream file can be extracted.

S210: and performing running data extraction on the running file according to the standard field.

Specifically, after the standard field corresponding to each text message is determined, the terminal extracts the streaming data in the streaming file and correspondingly stores the streaming data in the standard field to complete the extraction of the streaming data.

According to the method for analyzing the running file, after the running file is obtained, only the first preset number of preset lines can be extracted, the header lines are determined from the preset lines through the preset extraction logic, so that the attributes, namely the standard fields, represented by the fields in the header lines can be determined according to the text information corresponding to the fields in the header lines and the dictionary library, the running data of the running file can be extracted according to the standard fields, the downloading of different running files can be achieved without setting additional templates and the like, and the method has the advantages of compatibility, high flexibility, no need of encoding again and the like.

In one embodiment, the method for parsing the pipeline file may further include: and generating an analysis template according to the corresponding relation between the text information and the standard field, and storing the generated analysis template. After reading the first preset number of preset lines in the pipeline file, the method further includes: matching the content in each preset row with a table header in a prestored analysis template; and if the preset row matched with the header does not exist, continuously extracting the header row from the read preset row according to the field quantity.

In one embodiment, the parsing template further includes predefined parsing templates respectively corresponding to the plurality of servers, and the method for parsing the streaming file may further include: and if the preset line matched with the header exists, performing running data extraction on the running file through the analysis template.

Specifically, the parsing module may be generated in real time without querying a corresponding parsing template in the parsing process, or predefined parsing templates respectively corresponding to a plurality of servers.

The terminal can generate a predefined analysis module with specific positions in advance according to the data format characteristics of the running data files of each bank, the contents mainly stored by the analysis module can be shown in figure 3, a specified format is defined, corresponding data can be flexibly obtained in a regular expression mode, and the bank, the header information of the files and the position information of the characteristic fields of each running data are defined in an analysis template. In addition, for the positioning of the position of the characteristic field of the pipeline data, a | line-separated 6-segment format is adopted in the present application to perform the positioning or interception of the data, which is specifically shown as follows:

the value of the lattice before row | the location value | the value of the lattice after row | regular expression group list (multiple groups may be separated by commas), for example:

the # # account |5| represents the 5 th column value of any row from the 4 th column value as the account;

the # # # |5| house name represents the value from the column 6 of the 5 th row of any row with the value of the house name;

# # #5| represents a value from column 5 of row 5;

# # # # |7| account number: (\ \ d +) |1 denotes that "Account: and acquiring a corresponding account number from the characters.

In addition, the terminal does not inquire the corresponding analysis template in the analysis process, and the analysis template generated in real time may also include the above contents, specifically refer to the necessary contents in the analysis module generated in advance according to the pipeline files of each bank.

When the terminal starts to analyze the pipeline file, the terminal may analyze the pipeline file through a predefined analysis module generated according to the bank pipeline file, for example, a first preset number of preset lines in the pipeline file may be read, for example, the first 50 lines may be read to determine whether the content is consistent with the header of the analysis template, that is, whether a line of field content in the first 50 lines is consistent with the header of the analysis template is determined, if so, a matching analysis template exists, and thus, the pipeline data in the pipeline file is extracted through the analysis template. If the preset row which is consistent with the header in the analysis template does not exist, the template matching is failed, the terminal continues to extract the header row from the read preset row according to the preset extraction logic, and then the intelligent analysis without the analysis template is performed. And after the intelligent analysis is completed, the terminal generates a new analysis template according to the header row positioned in the intelligent analysis and the attribute and position of each pipelining field, so that when the pipelining file needs to be processed again in the follow-up process, the terminal can determine whether the template analysis can be performed or not by matching the analysis template generated according to the pipelining file of the bank and the newly generated analysis template in advance, and otherwise, the terminal continues to perform the intelligent analysis.

In the embodiment, according to the analysis result of each data characteristic, the analysis result of intelligent analysis is stored into a new analysis template, and similar files are encountered subsequently, so that the data flow analysis process can be accelerated without using the intelligent analysis process without the template every time, and the iteration of the new template with more blocks is realized to improve the flexibility of flow analysis.

In one embodiment, after the pipeline data extraction is performed on the pipeline file according to the standard field, the method further includes: extracting first to-be-processed data corresponding to a date field from the extracted running data; and acquiring the digital characteristic and the separator characteristic of the date field which are stored in advance, and processing the first data to be processed according to the digital characteristic and the separator characteristic of the date field to obtain the date.

In one embodiment, after the pipeline data extraction is performed on the pipeline file according to the standard field, the method further includes: extracting second data to be processed corresponding to the balance related field from the extracted running data; grouping the second data to be processed according to time; calculating whether the running balance field in each group is matched with second data to be processed corresponding to the transaction amount field; if the balance is matched with the transaction amount field, calculating the initial balance and the final balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group; and if not, outputting prompt information.

In one embodiment, after the pipeline data extraction is performed on the pipeline file according to the standard field, the method further includes: extracting third data to be processed corresponding to the multiple remark related fields from the extracted running water data; and merging the third data to be processed to obtain a transaction description remark field.

Specifically, after the terminal extracts the running water data, the terminal needs to perform service processing on the running water data, including correct extraction of date data, intelligent detection of balance, storage of service-related information, and the like.

Specifically, for the correct extraction of date data, that is, for the processing of the first to-be-processed data in the foregoing, since it is determined that the field is a date field, the format of the corresponding column of the field may be × - ×, ×/×/, or ×/×/, so that the terminal may first obtain the numerical features and separators in the column of the flowing water data, then process the first to-be-processed data according to the numerical features and the separator features of the date field to obtain a date, for example, first read the numbers including the year, month and day numbers, and then read the separators, so that the format of the date can be obtained and the corresponding date can be stored.

The intelligent balance detection is used for assisting the user to notice similar false statements and perform reminding functions, therefore, the terminal extracts the second data to be processed corresponding to the balance related field from the extracted running data, then grouping is carried out according to time, for example, if reminding is carried out by day, grouping is carried out according to day, then the terminal can calculate whether the running balance field in each group is matched with the second data to be processed corresponding to the transaction amount field, for example, the terminal may analyze whether the running balance of each item matches the transaction amount, so that when matching, and calculating the initial balance and the end balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group, for example, analyzing daily initial balance/end balance information according to the running. If not, it shows there is error, and then prompt is directly made.

The business related information is stored because the pipelining document of the bank may have a plurality of pieces of business related information such as remarks, abstracts, user messages and the like, and may be that the user leaves a description and the like in the transfer process, and the pipelining document is presented in a plurality of fields.

In the above embodiment, after the flowing water document is processed to extract flowing water data, such as the own bank, the own account number, the own account name, the transaction date, the transaction time, the transaction amount, the currency/income or expenditure identifier, the counter-party bank, the counter-party account number, the counter-party bank, the balance, the transaction description remark and the transaction type, the data are also processed differently to lay a foundation for subsequent business processing.

Specifically, referring to fig. 4, fig. 4 is a schematic diagram of a running water file parsing process in an embodiment, and specifically, in the embodiment, a data parsing process of a running water file is mainly given, and after receiving the running water file, a terminal needs to read the running water file to obtain a running water file stored in a two-dimensional array data format, so that a foundation is laid for a subsequent data parsing process of the running water file. That is to say, in the process of reading the file data, the running water file is input, the data format is judged in advance according to the content of the flood discharge file, and then the running water data is extracted according to lines, so that the data running water analysis process can analyze the specific running water characteristic information. In the process of data stream analysis, intelligent analysis is carried out through preset analysis template keyword position information or different characteristics of data, and therefore key characteristic field information of each stream in a stream file is obtained.

In one embodiment, after obtaining the streaming file downloaded from the server, the method further includes: identifying a format of the pipeline file; acquiring a data extraction engine corresponding to the identified format; and performing data extraction on the running water file through the acquired data extraction engine to obtain the running water file stored in a two-dimensional array data format.

Specifically, the format of the pipeline file may be identified by remaining to extract a second preset number of characters in the pipeline file, for example, the first 1024 characters, and then the format of the pipeline file is determined by identifying the second preset number of characters, so that a data extraction engine corresponding to the format of the pipeline file may be subsequently acquired, and thus the terminal performs data extraction on the pipeline file through the acquired data extraction engine to obtain the pipeline file stored in the two-dimensional array data format.

The format of the pipeline file can comprise a binary format, a text CSV format and a text HTML/XML format, wherein the binary file terminal can call an Excel parsing engine POI component to extract data; in a text CSV format, a terminal can call an Apache CSV file analysis engine to extract data; and the terminal can call Microsoft independent Excel software to perform table conversion of the text and then perform data extraction.

It should be noted that in this embodiment, the extracted data is stored in a two-dimensional array data format, so that a storage structure in a two-dimensional array table manner similar to an Excel table is adopted to perform data assembly, and the extracted data is respectively extracted and encapsulated according to rows and columns of the data appearing in the file. The two-dimensional array table mode can conveniently analyze data line by line and field by field, and simultaneously conveniently switch and position up, down, left and right. The subsequent analysis of the data stream is convenient.

In one embodiment, identifying the format of the pipeline file includes: reading a second preset number of characters in the stream file; and judging whether the number of first preset characters in the characters is greater than a first preset value or not, if so, the format of the streaming file is a binary format.

In one embodiment, after determining whether the number of the preset characters in the characters is greater than a preset value, the method further includes: if the number of first preset characters in the characters is not larger than a first preset value, processing codes are obtained by performing coding prediction on the read characters; carrying out preliminary analysis on the running file by processing the codes; and judging whether the number of second preset characters in the primarily analyzed flow file is greater than a second preset value or not, if so, determining that the flow file is in an HTML/XML format, otherwise, determining that the flow file is in a CSV text file.

Specifically, the terminal may first read a second preset number of characters in the running file, for example, read the top 1024 bytes of the running file to perform a judgment, and the terminal sequentially judges whether the 1024 bytes are the first preset characters, for example, whether the 1024 bytes are the \0 character, and if yes, the number of occurrences of the \0 character is greater than the first preset value, for example, 5, the format of the running file is determined to be the binary format. Otherwise, the file is a text file.

When the pipeline file is a text file, because the text file has a CSV text file and an HTML/XML format, and because different bank systems may have different codes, such as UTF8, GBK and the like, the terminal firstly carries out coding prediction on the read characters to obtain processing codes, wherein the processing codes are codes adopted by the pipeline file, and thus the terminal carries out primary analysis on the pipeline file through the processing codes; judging whether the number of the second preset characters in the primarily analyzed flow file is greater than a second preset value, wherein the second characters can be pointed brackets < or >, judging whether the number of the second preset characters in the primarily analyzed flow file is greater than the second preset value, for example, whether the number of the second preset characters is greater than 5, if so, judging the flow file to be in an HTML/XML format, otherwise, judging the flow file to be in a CSV text file.

In addition, for the file in the CSV format, due to the difference of the separators, in this embodiment, 5 common separators are preset, and after the sequence is analyzed by the occurrence frequency of the separators in the text, the highest separator is taken as the separator of the CSV text file, and the data field is extracted.

In the embodiment, the format is judged by extracting the second preset number of characters, so that the speed is high, and the accuracy is high.

In one embodiment, the processing code obtained by performing coding prediction on the read character includes: analyzing the read character through a plurality of codes to be selected in a preset assembly to obtain confidence degrees corresponding to the plurality of codes to be selected; and selecting the code to be selected with the highest confidence coefficient as the processing code.

In one embodiment, after analyzing the read character through a plurality of codes to be selected in the preset component to obtain confidence degrees corresponding to the plurality of codes to be selected, the method further includes: judging whether Chinese codes exist in the codes to be selected with the confidence degree sequence in the front preset bits; if so, taking the Chinese codes with the confidence degrees sequenced at the front preset bits as processing codes, and otherwise, continuously selecting the codes to be selected with the highest confidence degrees as the processing codes.

Specifically, the preset component may be an ICU component of IBM, and the terminal analyzes the read character through different codes to be selected in the ICU component to obtain confidence levels corresponding to multiple codes to be selected, so that the code to be selected with the highest confidence level is generally selected as the processing code. However, in order to avoid the interference of the codes of the korean/japanese with the prejudgment of the ICU component, when the terminal selects, the terminal firstly sorts the confidence of each code to be selected, and judges whether the confidence sorting exists in the code to be selected of the pre-set bit; if so, taking the Chinese codes with the confidence degrees sequenced at the front preset positions as processing codes, otherwise, continuously selecting the codes to be selected with the highest confidence degrees as the processing codes, and reading the content of the text file by the selected hammer codes to analyze the subsequent formats.

In the embodiment, the ICU component is introduced to determine the code corresponding to the stream file, and the format of the stream file is determined after the code is analyzed, so that the accuracy is improved.

It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 5, there is provided a pipelined file parsing apparatus, including: a download module 100, a header determination module 200, a first extraction module 300, a field determination module 400, and a second extraction module 500, wherein:

a downloading module 100, configured to obtain a streaming file downloaded from a server;

the header determining module 200 is configured to read preset rows of a first preset number in the stream file, and extract header rows from the read preset rows according to preset extraction logic;

the first extraction module 300 is configured to extract text information corresponding to each field in a header line;

a field determining module 400, configured to match the text information with a non-standard field in a pre-stored dictionary library to determine a standard field corresponding to the text information, where the standard field and the non-standard field are stored in the dictionary library in an associated manner;

and a second extraction module 500, configured to perform pipeline data extraction on the pipeline file according to the standard field.

In one embodiment, the pipelined file parsing apparatus may further include:

the template generating module is used for generating an analysis template according to the corresponding relation between the text information and the standard field and storing the generated analysis template;

the template matching module is used for matching the content in each preset row with a table header in a prestored analysis template after reading the first preset number of preset rows in the pipeline file;

the header determining module 200 is further configured to continue to extract the header row from the read preset rows according to the number of the fields if there is no preset row matching the header.

In one embodiment, the parsing template further includes predefined parsing templates respectively corresponding to the plurality of servers, and the apparatus for parsing the streaming file further includes:

and the third extraction module is used for extracting the pipeline data of the pipeline file through the analysis template if the preset line matched with the header exists.

In one embodiment, the pipelined file parsing apparatus may further include:

the date field extraction module is used for extracting first to-be-processed data corresponding to the date field from the extracted running data;

and the date acquisition module is used for acquiring the digital characteristics and the separator characteristics of the date field which are stored in advance, and processing the first data to be processed according to the digital characteristics and the separator characteristics of the date field to obtain the date.

In one embodiment, the pipelined file parsing apparatus may further include:

the balance field extraction module is used for extracting second data to be processed corresponding to the balance related field from the extracted running data;

the grouping module is used for grouping the second data to be processed according to time;

the matching judgment module is used for calculating whether the running balance field in each group is matched with the second data to be processed corresponding to the transaction amount field;

the balance calculation module is used for calculating the initial balance and the final balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group if the first data to be processed and the second data to be processed are matched;

and the output module is used for outputting prompt information if the input information is not matched with the input information.

In one embodiment, the pipelined file parsing apparatus may further include:

the remark field extraction module is used for extracting third to-be-processed data corresponding to a plurality of remark related fields from the extracted running data;

and the merging module is used for merging the third data to be processed to obtain the transaction description remark field.

In one embodiment, the pipelined file parsing apparatus may further include:

the format identification module is used for identifying the format of the stream file;

the data extraction engine acquisition module is used for acquiring a data extraction engine corresponding to the identified format;

and the running water file storage module is used for extracting data of the running water file through the acquired data extraction engine to obtain the running water file stored in a two-dimensional array data format.

In one embodiment, the format recognition module includes:

the first character reading unit is used for reading a second preset number of characters in the stream file;

and the format judging unit is used for judging whether the number of first preset characters in the characters is greater than a first preset value or not, and if so, the format of the streaming file is a binary format.

In one embodiment, the format recognition module further includes:

the encoding prediction unit is used for performing encoding prediction on the read characters to obtain processing codes if the number of first preset characters in the characters is not more than a first preset value;

the preliminary analysis unit is used for preliminarily analyzing the pipeline file through processing the codes;

the format judging unit is further used for judging whether the number of second preset characters in the primarily analyzed flow file is larger than a second preset value or not, if so, the flow file is in an HTML/XML format, and otherwise, the flow file is in a CSV text file.

In one embodiment, encoding the prediction unit may include:

the confidence meter operator unit is used for analyzing the read character through various codes to be selected in the preset assembly so as to obtain confidence degrees corresponding to the various codes to be selected;

and the selecting subunit is used for selecting the code to be selected with the highest confidence coefficient as the processing code.

In one embodiment, the format recognition module further includes:

the Chinese coding judging unit is used for judging whether Chinese coding exists in the coding to be selected with the confidence degree sequence in the front preset bit;

the coding prediction unit is also used for taking the Chinese codes with the confidence degrees sequenced at the front preset bits as processing codes if the Chinese codes are in the positive order, and continuing to select the codes to be selected with the highest confidence degrees as the processing codes if the Chinese codes are not in the positive order.

For specific limitations of the pipeline file parsing apparatus, reference may be made to the above limitations of the pipeline file parsing method, which are not described herein again. All or part of the modules in the streaming file parsing device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a pipelined file parsing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a streaming file downloaded from a server; reading a first preset number of preset rows in the stream file, and extracting header rows from the read preset rows according to preset extraction logic; extracting text information corresponding to each field in a header line; matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, wherein the standard fields and the non-standard fields are stored in the dictionary library in an associated manner; and performing running data extraction on the running file according to the standard field.

In one embodiment, the processor, when executing the computer program, further performs the steps of: generating an analysis template according to the corresponding relation between the text information and the standard field, and storing the generated analysis template; after the processor reads the preset lines of the first preset number in the stream file when executing the computer program, the method further comprises the following steps: matching the content in each preset row with a table header in a prestored analysis template; and if the preset row matched with the header does not exist, continuously extracting the header row from the read preset row according to the field quantity.

In one embodiment, the parsing templates implemented when the processor executes the computer program further include predefined parsing templates respectively corresponding to the plurality of servers, and the processor executes the computer program further implements the following steps: and if the preset line matched with the header exists, performing running data extraction on the running file through the analysis template.

In one embodiment, after the pipeline data extraction of the pipeline file according to the standard field is implemented when the processor executes the computer program, the method further includes: extracting first to-be-processed data corresponding to a date field from the extracted running data; and acquiring the digital characteristic and the separator characteristic of the date field which are stored in advance, and processing the first data to be processed according to the digital characteristic and the separator characteristic of the date field to obtain the date.

In one embodiment, after the pipeline data extraction of the pipeline file according to the standard field is implemented when the processor executes the computer program, the method further includes: extracting second data to be processed corresponding to the balance related field from the extracted running data; grouping the second data to be processed according to time; calculating whether the running balance field in each group is matched with second data to be processed corresponding to the transaction amount field; if the balance is matched with the transaction amount field, calculating the initial balance and the final balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group; and if not, outputting prompt information.

In one embodiment, after the pipeline data extraction of the pipeline file according to the standard field is implemented when the processor executes the computer program, the method further includes: extracting third data to be processed corresponding to the multiple remark related fields from the extracted running water data; and merging the third data to be processed to obtain a transaction description remark field.

In one embodiment, the obtaining of the streaming file downloaded from the server when the processor executes the computer program further comprises: identifying a format of the pipeline file; acquiring a data extraction engine corresponding to the identified format; and performing data extraction on the running water file through the acquired data extraction engine to obtain the running water file stored in a two-dimensional array data format.

In one embodiment, identifying the format of the pipeline file as implemented by a processor executing the computer program comprises: reading a second preset number of characters in the stream file; and judging whether the number of first preset characters in the characters is greater than a first preset value or not, if so, the format of the streaming file is a binary format.

In one embodiment, after the determining whether the number of the preset characters in the characters is greater than the preset value is implemented when the processor executes the computer program, the method further includes: if the number of first preset characters in the characters is not larger than a first preset value, processing codes are obtained by performing coding prediction on the read characters; carrying out preliminary analysis on the running file by processing the codes; and judging whether the number of second preset characters in the primarily analyzed flow file is greater than a second preset value or not, if so, determining that the flow file is in an HTML/XML format, otherwise, determining that the flow file is in a CSV text file.

In one embodiment, the processor, implemented when executing the computer program, obtains the processing code by performing coding prediction on the read character, including: analyzing the read character through a plurality of codes to be selected in a preset assembly to obtain confidence degrees corresponding to the plurality of codes to be selected; and selecting the code to be selected with the highest confidence coefficient as the processing code.

In one embodiment, after the processor analyzes the read character through the multiple codes to be selected in the preset component to obtain the confidence degrees corresponding to the multiple codes to be selected, when the processor executes the computer program, the method further includes: judging whether Chinese codes exist in the codes to be selected with the confidence degree sequence in the front preset bits; if so, taking the Chinese codes with the confidence degrees sequenced at the front preset bits as processing codes, and otherwise, continuously selecting the codes to be selected with the highest confidence degrees as the processing codes.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a streaming file downloaded from a server; reading a first preset number of preset rows in the stream file, and extracting header rows from the read preset rows according to preset extraction logic; extracting text information corresponding to each field in a header line; matching the text information with non-standard fields in a pre-stored dictionary library to determine standard fields corresponding to the text information, wherein the standard fields and the non-standard fields are stored in the dictionary library in an associated manner; and performing running data extraction on the running file according to the standard field.

In one embodiment, the computer program when executed by the processor further performs the steps of: generating an analysis template according to the corresponding relation between the text information and the standard field, and storing the generated analysis template; after the computer program is executed by the processor to read a first preset number of preset lines in the stream file, the method further includes: matching the content in each preset row with a table header in a prestored analysis template; and if the preset row matched with the header does not exist, continuously extracting the header row from the read preset row according to the field quantity.

In one embodiment, the parsing templates implemented when the computer program is executed by the processor further include predefined parsing templates corresponding to the plurality of servers, respectively, and the processor further implements the following steps when executing the computer program: and if the preset line matched with the header exists, performing running data extraction on the running file through the analysis template.

In one embodiment, the computer program, when executed by a processor, further performs the following steps after performing the pipelined data extraction on the pipelined file according to the standard field: extracting first to-be-processed data corresponding to a date field from the extracted running data; and acquiring the digital characteristic and the separator characteristic of the date field which are stored in advance, and processing the first data to be processed according to the digital characteristic and the separator characteristic of the date field to obtain the date.

In one embodiment, the computer program, when executed by a processor, further performs the following steps after performing the pipelined data extraction on the pipelined file according to the standard field: extracting second data to be processed corresponding to the balance related field from the extracted running data; grouping the second data to be processed according to time; calculating whether the running balance field in each group is matched with second data to be processed corresponding to the transaction amount field; if the balance is matched with the transaction amount field, calculating the initial balance and the final balance corresponding to each group according to the second data to be processed corresponding to the running balance field and the transaction amount field in each group; and if not, outputting prompt information.

In one embodiment, the computer program, when executed by a processor, further performs the following steps after performing the pipelined data extraction on the pipelined file according to the standard field: extracting third data to be processed corresponding to the multiple remark related fields from the extracted running water data; and merging the third data to be processed to obtain a transaction description remark field.

In one embodiment, the computer program, when executed by the processor, further performs the following steps after obtaining the streaming file downloaded from the server: identifying a format of the pipeline file; acquiring a data extraction engine corresponding to the identified format; and performing data extraction on the running water file through the acquired data extraction engine to obtain the running water file stored in a two-dimensional array data format.

In one embodiment, identifying the format of the streamed file, as implemented by a computer program when executed by a processor, comprises: reading a second preset number of characters in the stream file; and judging whether the number of first preset characters in the characters is greater than a first preset value or not, if so, the format of the streaming file is a binary format.

In one embodiment, after the determining whether the number of the preset characters in the characters is greater than the preset value is performed by the processor, the method further includes: if the number of first preset characters in the characters is not larger than a first preset value, processing codes are obtained by performing coding prediction on the read characters; carrying out preliminary analysis on the running file by processing the codes; and judging whether the number of second preset characters in the primarily analyzed flow file is greater than a second preset value or not, if so, determining that the flow file is in an HTML/XML format, otherwise, determining that the flow file is in a CSV text file.

In one embodiment, the processing encoding by encoding prediction of the read character implemented when the computer program is executed by the processor comprises: analyzing the read character through a plurality of codes to be selected in a preset assembly to obtain confidence degrees corresponding to the plurality of codes to be selected; and selecting the code to be selected with the highest confidence coefficient as the processing code.

In one embodiment, after the computer program, implemented when executed by the processor, parses the read character through the multiple codes to be selected in the preset component to obtain the confidence degrees corresponding to the multiple codes to be selected, the method further includes: judging whether Chinese codes exist in the codes to be selected with the confidence degree sequence in the front preset bits; if so, taking the Chinese codes with the confidence degrees sequenced at the front preset bits as processing codes, and otherwise, continuously selecting the codes to be selected with the highest confidence degrees as the processing codes.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for parsing a pipeline file, the method comprising:

acquiring a streaming file downloaded from a server;

extracting text information corresponding to each field in the header line;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the parsing template further comprises predefined parsing templates respectively corresponding to a plurality of servers, the method further comprising:

4. The method according to any one of claims 1 to 3, wherein after the extracting the pipeline data from the pipeline file according to the standard field, the method further comprises:

5. The method according to any one of claims 1 to 3, wherein after the extracting the pipeline data from the pipeline file according to the standard field, the method further comprises:

grouping the second data to be processed according to time;

and if not, outputting prompt information.

6. The method according to any one of claims 1 to 3, wherein after the extracting the pipeline data from the pipeline file according to the standard field, the method further comprises:

7. The method according to any one of claims 1 to 3, wherein after the obtaining the streaming file downloaded from the server, further comprising:

identifying a format of the pipeline file;

acquiring a data extraction engine corresponding to the identified format;

8. The method of claim 7, wherein the identifying the format of the pipeline file comprises:

reading a second preset number of characters in the stream file;

9. The method according to claim 8, wherein after determining whether the number of the preset characters in the characters is greater than a preset value, the method further comprises:

10. The method of claim 9, wherein said processing encoding by encoding prediction of said read character comprises:

11. The method according to claim 10, wherein after parsing the read character through a plurality of codes to be selected in a preset component to obtain confidence levels corresponding to the plurality of codes to be selected, the method further comprises:

12. A pipelined file parsing apparatus, the apparatus comprising:

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.