CN107943763A - A kind of big text data processing method - Google Patents

A kind of big text data processing method Download PDF

Info

Publication number
CN107943763A
CN107943763A CN201711222445.4A CN201711222445A CN107943763A CN 107943763 A CN107943763 A CN 107943763A CN 201711222445 A CN201711222445 A CN 201711222445A CN 107943763 A CN107943763 A CN 107943763A
Authority
CN
China
Prior art keywords
file
big
big text
processing method
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711222445.4A
Other languages
Chinese (zh)
Inventor
江山
吴志勇
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Mian Information Technology Co Ltd
Original Assignee
Guangzhou Mian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Mian Information Technology Co Ltd filed Critical Guangzhou Mian Information Technology Co Ltd
Priority to CN201711222445.4A priority Critical patent/CN107943763A/en
Publication of CN107943763A publication Critical patent/CN107943763A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/149Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of big text data processing method, its step is:By big text resolution into stream;Vernier mechanism is established in file stream;File data storage is read to database.The present invention can solve the difficult loading of big file, and take a large amount of memories during parsing so as to the problem of causing memory to overflow;The processing mode parsed when reading, strictly controls the loading capacity of memory, makes its efficient parsing file while small memory is taken;By the data in vernier mechanism stage extraction file, there is very efficient treatment effeciency.

Description

A kind of big text data processing method
Technical field
The present invention relates to a kind of processing method, more particularly to a kind of big text data processing method.
Background technology
With the increase of corporate business amount, the data volume of intra-company's daily requirement processing is also constantly increasing severely, therefore The file stored when archives data can also greatly increase, and some file sizes have been even more than 4G.And it is well known that computer Fdisk file system format FAT32 does not support the file more than 4G, and when big file directly is read memory, needs Very big memory is loaded, being easy to cause computer, either the memory of server free exhausts or directly contribute memory and overflows quickly The phenomenon gone out.In the case that computer memory is sufficiently large, it is also very slow that being filtered out from memory, which needs the efficiency of data, Slow.Therefore, a kind of efficient big text data processing mode of exploitation has important practical significance.
The content of the invention
In order to solve the shortcoming present in above-mentioned technology, the present invention provides a kind of big text data processing method.
In order to solve the above technical problems, the technical solution adopted by the present invention is:A kind of big text data processing method, its Overall step is:
Step 1: by big text resolution into stream;
Step 2: vernier mechanism is established in file stream;
Step 3: file data storage is read to database.
In step 1, by big text resolution into stream by the way of being parsed while reading.
In step 3, it is segmented reading file data by the vernier mechanism of step 2 foundation when reading file and preserves Into database.
Big text data includes the data file stored with txt, excel, svg, xml form.
The present invention can solve the difficult loading of big file, and take a large amount of memories during parsing so as to cause asking for memory spilling Topic;The processing mode parsed when reading, strictly controls the loading capacity of memory, it is efficiently solved while small memory is taken Analyse file;By the data in vernier mechanism stage extraction file, there is very efficient treatment effeciency.
Brief description of the drawings
Fig. 1 is the overall flow schematic diagram of the present invention.
Embodiment
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
A kind of big text data processing method shown in Fig. 1, big text data are included with txt, excel, svg, xml form The big text data file of storage, the key step of this method are as follows:
Step 1: by big text resolution into stream by the way of being parsed while reading;First by way of scanning, traversal text All rows in part, it is allowed to each row is handled, so can effectively control parsing without keeping the reference to it The memory taken during file, while avoid repeating to read, it can significantly lift reading efficiency;By strictly controlling making for memory With the high efficiency for ensureing the feasibility of big document analysis and parsing.
Step 2: vernier mechanism is established in file stream, so as to substantially increase the efficiency for reading data.Vernier (Cursor) it is a kind of method for handling data, for the data checked or handling result is concentrated, vernier provided in result An a line or multirow advance or the ability for browsing data backward are concentrated, can be vernier as a pointer, it can refer to Determine any position in result, then allow user to handle the data of designated position.
Step 3: be segmented reading data successively using vernier mechanism, data that the part of reading is needed according to conditional filtering Store database.Whether last interpretation, which reads, finishes, and terminates if reading is over, and is read if not read circulation.
The present invention embodiment be:
1st, file is obtained by DPS data-reduction systems, then load document forms file object.
2nd, document analysis function is called, by big text resolution into stream by the way of being parsed while reading.
3rd, vernier mechanism is established in file stream as efficiency guarantee when reading.
4th, by calling sequence table function, the file stream process Cheng Xubiao of parsing;This process is that segmentation is read out place Reason, take small memory and high efficiency when having ensured and read herein using vernier mechanism.
5th, sequence table is carried out storage processing by entering built-in function;Database can call the database table automatically generated, or Person is the database table (it is recommended that being gone to create database table manually according to business) of manual creation.
6th, finished to judge whether file reads by vernier mechanism, finished if read, terminate current process;If also It is untreated complete, circulate operation above.
The present invention uses the processing mode parsed when reading, can effectively solve the reluctant problem of big file;At the same time Vernier is formed when reading file stream, the data by way of vernier segmentation in extraction document, have very efficient processing Efficiency.
The above embodiment is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck The variations, modifications, additions or substitutions that the technical staff in domain is made in the range of technical scheme, also belong to this hair Bright protection domain.

Claims (4)

  1. A kind of 1. big text data processing method, it is characterised in that:The overall step of the method is:
    Step 1: by big text resolution into stream;
    Step 2: vernier mechanism is established in file stream;
    Step 3: file data storage is read to database.
  2. 2. big text data processing method according to claim 1, it is characterised in that:In the step 1, read using side The mode of side parsing is by big text resolution into stream.
  3. 3. big text data processing method according to claim 2, it is characterised in that:In the step 3, text is being read It is segmented reading file data by the vernier mechanism of step 2 foundation during part and is saved in database.
  4. 4. the big text data processing method according to claim 1 or 3, it is characterised in that:The big text data includes The data file stored with txt, excel, svg, xml form.
CN201711222445.4A 2017-11-29 2017-11-29 A kind of big text data processing method Pending CN107943763A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711222445.4A CN107943763A (en) 2017-11-29 2017-11-29 A kind of big text data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711222445.4A CN107943763A (en) 2017-11-29 2017-11-29 A kind of big text data processing method

Publications (1)

Publication Number Publication Date
CN107943763A true CN107943763A (en) 2018-04-20

Family

ID=61949458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711222445.4A Pending CN107943763A (en) 2017-11-29 2017-11-29 A kind of big text data processing method

Country Status (1)

Country Link
CN (1) CN107943763A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128208A (en) * 2021-04-26 2021-07-16 浙江百应科技有限公司 JSON file parsing method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101057234A (en) * 2004-11-15 2007-10-17 先进设计株式会社 Text data structure, text data processing method
CN102456053A (en) * 2010-11-02 2012-05-16 江苏大学 Method for mapping XML document to database
US8332209B2 (en) * 2007-04-24 2012-12-11 Zinovy D. Grinblat Method and system for text compression and decompression
CN105550176A (en) * 2014-10-29 2016-05-04 镇江华扬信息科技有限公司 Basic mapping method for relational database and XML
CN107368610A (en) * 2017-08-11 2017-11-21 北明智通(北京)科技有限公司 Big text CRF and rule classification method and system based on full text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101057234A (en) * 2004-11-15 2007-10-17 先进设计株式会社 Text data structure, text data processing method
US8332209B2 (en) * 2007-04-24 2012-12-11 Zinovy D. Grinblat Method and system for text compression and decompression
CN102456053A (en) * 2010-11-02 2012-05-16 江苏大学 Method for mapping XML document to database
CN105550176A (en) * 2014-10-29 2016-05-04 镇江华扬信息科技有限公司 Basic mapping method for relational database and XML
CN107368610A (en) * 2017-08-11 2017-11-21 北明智通(北京)科技有限公司 Big text CRF and rule classification method and system based on full text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASDFSX: "Python多进程分块读取超大文件的方法", 《HTTPS://WWW.JB51.NET/ARTICLE/82316.HTM》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128208A (en) * 2021-04-26 2021-07-16 浙江百应科技有限公司 JSON file parsing method and device and electronic equipment
CN113128208B (en) * 2021-04-26 2024-01-05 浙江百应科技有限公司 JSON file analysis method and device and electronic equipment

Similar Documents

Publication Publication Date Title
RU2579899C1 (en) Document processing using multiple processing flows
CN103455475B (en) Composition method, equipment and system
US20150379026A1 (en) Content fabric for a distributed file system
CN108536745B (en) Shell-based data table extraction method, terminal, equipment and storage medium
CN105787012B (en) A kind of method and storage system improving storage system processing small documents
CN105677742A (en) Method and apparatus for storing files
US7370060B2 (en) System and method for user edit merging with preservation of unrepresented data
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN101996252A (en) Expression method of indexing information for node element in XML (Extensive Makeup Language) file
JP2006202297A (en) System and method for storing document in serial binary format
CN109783810A (en) A kind of text handling method, device and computer readable storage medium
CN107943763A (en) A kind of big text data processing method
CN102937948A (en) Image-text data editing method for mobile terminal
CN107741968A (en) A kind of method of document retrieval, system, device and computer-readable recording medium
CN110457264A (en) Committee paper processing method, device, equipment and computer readable storage medium
CN107643892B (en) Interface processing method, device, storage medium and processor
CN111427854B (en) Stack structure realizing method, device, equipment and medium for supporting storage batch data
CN112446373B (en) Method, system, computer device and storage medium for identifying converted image file
CN111401032B (en) Text processing method, device, computer equipment and storage medium
CN104484174A (en) Processing method and processing device for compressed file with RAR (Roshal A Rchive) format
CN103942186A (en) Method and system for managing documents
CN104199894A (en) Method and device for scanning files
CN113626420A (en) Data preprocessing method and device and readable storage medium
CN107562452A (en) Terminal preset application update method, intelligent terminal and the device with store function
CN113704214A (en) Electronic file type conversion method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180420