CN107943763A - A kind of big text data processing method - Google Patents
A kind of big text data processing method Download PDFInfo
- Publication number
- CN107943763A CN107943763A CN201711222445.4A CN201711222445A CN107943763A CN 107943763 A CN107943763 A CN 107943763A CN 201711222445 A CN201711222445 A CN 201711222445A CN 107943763 A CN107943763 A CN 107943763A
- Authority
- CN
- China
- Prior art keywords
- file
- big
- big text
- processing method
- data processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/149—Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of big text data processing method, its step is:By big text resolution into stream;Vernier mechanism is established in file stream;File data storage is read to database.The present invention can solve the difficult loading of big file, and take a large amount of memories during parsing so as to the problem of causing memory to overflow;The processing mode parsed when reading, strictly controls the loading capacity of memory, makes its efficient parsing file while small memory is taken;By the data in vernier mechanism stage extraction file, there is very efficient treatment effeciency.
Description
Technical field
The present invention relates to a kind of processing method, more particularly to a kind of big text data processing method.
Background technology
With the increase of corporate business amount, the data volume of intra-company's daily requirement processing is also constantly increasing severely, therefore
The file stored when archives data can also greatly increase, and some file sizes have been even more than 4G.And it is well known that computer
Fdisk file system format FAT32 does not support the file more than 4G, and when big file directly is read memory, needs
Very big memory is loaded, being easy to cause computer, either the memory of server free exhausts or directly contribute memory and overflows quickly
The phenomenon gone out.In the case that computer memory is sufficiently large, it is also very slow that being filtered out from memory, which needs the efficiency of data,
Slow.Therefore, a kind of efficient big text data processing mode of exploitation has important practical significance.
The content of the invention
In order to solve the shortcoming present in above-mentioned technology, the present invention provides a kind of big text data processing method.
In order to solve the above technical problems, the technical solution adopted by the present invention is:A kind of big text data processing method, its
Overall step is:
Step 1: by big text resolution into stream;
Step 2: vernier mechanism is established in file stream;
Step 3: file data storage is read to database.
In step 1, by big text resolution into stream by the way of being parsed while reading.
In step 3, it is segmented reading file data by the vernier mechanism of step 2 foundation when reading file and preserves
Into database.
Big text data includes the data file stored with txt, excel, svg, xml form.
The present invention can solve the difficult loading of big file, and take a large amount of memories during parsing so as to cause asking for memory spilling
Topic;The processing mode parsed when reading, strictly controls the loading capacity of memory, it is efficiently solved while small memory is taken
Analyse file;By the data in vernier mechanism stage extraction file, there is very efficient treatment effeciency.
Brief description of the drawings
Fig. 1 is the overall flow schematic diagram of the present invention.
Embodiment
The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
A kind of big text data processing method shown in Fig. 1, big text data are included with txt, excel, svg, xml form
The big text data file of storage, the key step of this method are as follows:
Step 1: by big text resolution into stream by the way of being parsed while reading;First by way of scanning, traversal text
All rows in part, it is allowed to each row is handled, so can effectively control parsing without keeping the reference to it
The memory taken during file, while avoid repeating to read, it can significantly lift reading efficiency;By strictly controlling making for memory
With the high efficiency for ensureing the feasibility of big document analysis and parsing.
Step 2: vernier mechanism is established in file stream, so as to substantially increase the efficiency for reading data.Vernier
(Cursor) it is a kind of method for handling data, for the data checked or handling result is concentrated, vernier provided in result
An a line or multirow advance or the ability for browsing data backward are concentrated, can be vernier as a pointer, it can refer to
Determine any position in result, then allow user to handle the data of designated position.
Step 3: be segmented reading data successively using vernier mechanism, data that the part of reading is needed according to conditional filtering
Store database.Whether last interpretation, which reads, finishes, and terminates if reading is over, and is read if not read circulation.
The present invention embodiment be:
1st, file is obtained by DPS data-reduction systems, then load document forms file object.
2nd, document analysis function is called, by big text resolution into stream by the way of being parsed while reading.
3rd, vernier mechanism is established in file stream as efficiency guarantee when reading.
4th, by calling sequence table function, the file stream process Cheng Xubiao of parsing;This process is that segmentation is read out place
Reason, take small memory and high efficiency when having ensured and read herein using vernier mechanism.
5th, sequence table is carried out storage processing by entering built-in function;Database can call the database table automatically generated, or
Person is the database table (it is recommended that being gone to create database table manually according to business) of manual creation.
6th, finished to judge whether file reads by vernier mechanism, finished if read, terminate current process;If also
It is untreated complete, circulate operation above.
The present invention uses the processing mode parsed when reading, can effectively solve the reluctant problem of big file;At the same time
Vernier is formed when reading file stream, the data by way of vernier segmentation in extraction document, have very efficient processing
Efficiency.
The above embodiment is not limitation of the present invention, and the present invention is also not limited to the example above, this technology neck
The variations, modifications, additions or substitutions that the technical staff in domain is made in the range of technical scheme, also belong to this hair
Bright protection domain.
Claims (4)
- A kind of 1. big text data processing method, it is characterised in that:The overall step of the method is:Step 1: by big text resolution into stream;Step 2: vernier mechanism is established in file stream;Step 3: file data storage is read to database.
- 2. big text data processing method according to claim 1, it is characterised in that:In the step 1, read using side The mode of side parsing is by big text resolution into stream.
- 3. big text data processing method according to claim 2, it is characterised in that:In the step 3, text is being read It is segmented reading file data by the vernier mechanism of step 2 foundation during part and is saved in database.
- 4. the big text data processing method according to claim 1 or 3, it is characterised in that:The big text data includes The data file stored with txt, excel, svg, xml form.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711222445.4A CN107943763A (en) | 2017-11-29 | 2017-11-29 | A kind of big text data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711222445.4A CN107943763A (en) | 2017-11-29 | 2017-11-29 | A kind of big text data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107943763A true CN107943763A (en) | 2018-04-20 |
Family
ID=61949458
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711222445.4A Pending CN107943763A (en) | 2017-11-29 | 2017-11-29 | A kind of big text data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107943763A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113128208A (en) * | 2021-04-26 | 2021-07-16 | 浙江百应科技有限公司 | JSON file parsing method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101057234A (en) * | 2004-11-15 | 2007-10-17 | 先进设计株式会社 | Text data structure, text data processing method |
CN102456053A (en) * | 2010-11-02 | 2012-05-16 | 江苏大学 | Method for mapping XML document to database |
US8332209B2 (en) * | 2007-04-24 | 2012-12-11 | Zinovy D. Grinblat | Method and system for text compression and decompression |
CN105550176A (en) * | 2014-10-29 | 2016-05-04 | 镇江华扬信息科技有限公司 | Basic mapping method for relational database and XML |
CN107368610A (en) * | 2017-08-11 | 2017-11-21 | 北明智通(北京)科技有限公司 | Big text CRF and rule classification method and system based on full text |
-
2017
- 2017-11-29 CN CN201711222445.4A patent/CN107943763A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101057234A (en) * | 2004-11-15 | 2007-10-17 | 先进设计株式会社 | Text data structure, text data processing method |
US8332209B2 (en) * | 2007-04-24 | 2012-12-11 | Zinovy D. Grinblat | Method and system for text compression and decompression |
CN102456053A (en) * | 2010-11-02 | 2012-05-16 | 江苏大学 | Method for mapping XML document to database |
CN105550176A (en) * | 2014-10-29 | 2016-05-04 | 镇江华扬信息科技有限公司 | Basic mapping method for relational database and XML |
CN107368610A (en) * | 2017-08-11 | 2017-11-21 | 北明智通(北京)科技有限公司 | Big text CRF and rule classification method and system based on full text |
Non-Patent Citations (1)
Title |
---|
ASDFSX: "Python多进程分块读取超大文件的方法", 《HTTPS://WWW.JB51.NET/ARTICLE/82316.HTM》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113128208A (en) * | 2021-04-26 | 2021-07-16 | 浙江百应科技有限公司 | JSON file parsing method and device and electronic equipment |
CN113128208B (en) * | 2021-04-26 | 2024-01-05 | 浙江百应科技有限公司 | JSON file analysis method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
RU2579899C1 (en) | Document processing using multiple processing flows | |
CN103455475B (en) | Composition method, equipment and system | |
US20150379026A1 (en) | Content fabric for a distributed file system | |
CN108536745B (en) | Shell-based data table extraction method, terminal, equipment and storage medium | |
CN105787012B (en) | A kind of method and storage system improving storage system processing small documents | |
CN105677742A (en) | Method and apparatus for storing files | |
US7370060B2 (en) | System and method for user edit merging with preservation of unrepresented data | |
WO2019041442A1 (en) | Method and system for structural extraction of figure data, electronic device, and computer readable storage medium | |
CN101996252A (en) | Expression method of indexing information for node element in XML (Extensive Makeup Language) file | |
JP2006202297A (en) | System and method for storing document in serial binary format | |
CN109783810A (en) | A kind of text handling method, device and computer readable storage medium | |
CN107943763A (en) | A kind of big text data processing method | |
CN102937948A (en) | Image-text data editing method for mobile terminal | |
CN107741968A (en) | A kind of method of document retrieval, system, device and computer-readable recording medium | |
CN110457264A (en) | Committee paper processing method, device, equipment and computer readable storage medium | |
CN107643892B (en) | Interface processing method, device, storage medium and processor | |
CN111427854B (en) | Stack structure realizing method, device, equipment and medium for supporting storage batch data | |
CN112446373B (en) | Method, system, computer device and storage medium for identifying converted image file | |
CN111401032B (en) | Text processing method, device, computer equipment and storage medium | |
CN104484174A (en) | Processing method and processing device for compressed file with RAR (Roshal A Rchive) format | |
CN103942186A (en) | Method and system for managing documents | |
CN104199894A (en) | Method and device for scanning files | |
CN113626420A (en) | Data preprocessing method and device and readable storage medium | |
CN107562452A (en) | Terminal preset application update method, intelligent terminal and the device with store function | |
CN113704214A (en) | Electronic file type conversion method and device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180420 |