CN111898340A - File processing method and device and readable storage medium - Google Patents

File processing method and device and readable storage medium Download PDF

Info

Publication number
CN111898340A
CN111898340A CN202010750284.1A CN202010750284A CN111898340A CN 111898340 A CN111898340 A CN 111898340A CN 202010750284 A CN202010750284 A CN 202010750284A CN 111898340 A CN111898340 A CN 111898340A
Authority
CN
China
Prior art keywords
file
determining
csv file
csv
byte array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010750284.1A
Other languages
Chinese (zh)
Inventor
江国洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010750284.1A priority Critical patent/CN111898340A/en
Publication of CN111898340A publication Critical patent/CN111898340A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Abstract

The file processing method, device and readable storage medium provided by the embodiment of the disclosure comprise: receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array; determining the encoding format of the CSV file according to the byte array; file data in the CSV file is determined according to the encoding format. The method, the equipment and the readable storage medium provided by the embodiment of the disclosure can convert the CSV file into the byte array, and then determine the coding format of the file based on the byte array, so that the CSV file can be accurately analyzed, and the problems of data messy codes and coding errors caused by unknown file format codes are avoided.

Description

File processing method and device and readable storage medium
Technical Field
The present disclosure relates to file processing technologies, and in particular, to a file processing method, device, and readable storage medium.
Background
The character-Separated Values (CSV) file stores table data (numbers and text) in a plain text form. Plain text means that the file is a sequence of characters, containing no data that must be interpreted like binary digits. CSV files are composed of any number of records, and the records are separated by a certain linefeed character; each record is made up of fields, and separators between fields are other characters or strings, most commonly commas or tabs.
CSV files are often used as a format for data interaction between different programs. Therefore, a read operation is required for the CSV file, but the CSV file has many coding formats, and thus problems such as data scrambling and coding errors are likely to occur when the CSV file is read.
Disclosure of Invention
The embodiment of the disclosure provides a file processing method, a file processing device and a readable storage medium, which are used for solving the problems of data scrambling, coding errors and the like when a CSV file is processed.
In a first aspect, an embodiment of the present disclosure provides a file processing method, including:
receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array;
determining the encoding format of the CSV file according to the byte array;
and determining file data in the CSV file according to the coding format.
In one possible design, the determining file data in the CSV file according to the encoding format includes:
determining an interpreter according to the encoding format, and reading the separators included in the byte array through the interpreter;
and determining file data in the CSV file according to the read delimiters.
In one possible design, the determining an encoding format of the CSV file from the byte array includes:
and identifying a character distribution mode in the byte array, and determining the encoding format of the CSV file according to the character distribution mode.
In one possible design, the determining file data in the CSV file according to the read delimiters includes:
determining target separators according to the number of the read various separators;
and determining file data in the CSV file according to the target separator.
In a possible design, when converting the file content in the CSV file into a byte array, if the file content includes a preset character, the encoding format of the CSV file is converted from a first type to a second type.
In one possible design, the converting the encoding format of the CSV file from the first type to the second type includes:
deleting the preset character, and determining the byte array according to the file content after the preset character is deleted.
In one possible design, the predetermined character is \ ufeff.
In one possible design, further comprising:
packaging the file data into a document conforming to a preset format;
and displaying the packaged file data in the document according to the preset format.
In a second aspect, an embodiment of the present disclosure provides a file processing device, including:
the conversion module is used for receiving a character segmentation value CSV file and converting file contents in the CSV file into a byte array;
the format determining module is used for determining the encoding format of the CSV file according to the byte array;
and the data extraction module is used for determining the file data in the CSV file according to the coding format.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of processing the file as described above in the first aspect and in various possible designs of the first aspect.
In a fourth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the processing method of the file according to the first aspect and various possible designs of the first aspect is implemented.
The file processing method, device and readable storage medium provided by the embodiment of the disclosure comprise: receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array; determining the encoding format of the CSV file according to the byte array; file data in the CSV file is determined according to the encoding format. The method, the equipment and the readable storage medium provided by the embodiment of the disclosure can convert the CSV file into the byte array, and then determine the coding format of the file based on the byte array, so that the CSV file can be accurately analyzed, and the problems of data messy codes and coding errors caused by unknown file format codes are avoided.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flow chart illustrating a method of processing a file according to an exemplary embodiment of the present disclosure;
FIG. 2A is a system architecture diagram illustrating an exemplary embodiment of the present disclosure;
FIG. 2B is a schematic diagram of a user interface shown in an exemplary embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating a method of processing a file according to another exemplary embodiment of the present disclosure;
FIG. 4 is a block diagram of a file processing device shown in an exemplary embodiment of the present disclosure;
FIG. 5 is a block diagram of a file processing device according to another exemplary embodiment of the present disclosure;
fig. 6 is a block diagram of an electronic device shown in an exemplary embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
Common code formats of CSV (Comma-Separated Values) files are many, and when reading CSV files, problems of data scrambling and code errors are likely to occur due to uncertain code formats.
Common coding formats for CSV files include:
GB2312 simplified Chinese coding;
BIG5 encoding traditional Chinese;
GBK supports both simplified and traditional Chinese coding;
one of UTF-8Unicode encoding;
GB18030 character set.
In the scheme provided by the disclosure, the CSV file is converted into the byte array, the coding format of the CSV file is determined according to the byte array, and the CSV file is read based on the determined coding format, so that the problem of file reading failure can be effectively avoided.
Fig. 1 is a flowchart illustrating a file processing method according to an exemplary embodiment of the present disclosure.
As shown in fig. 1, the method for processing a file provided in this embodiment includes:
step 101, receiving a CSV file, and converting file contents in the CSV file into a byte array.
Fig. 2A is a system architecture diagram illustrating an exemplary embodiment of the present disclosure.
As shown in fig. 2A, the system architecture may specifically include a user terminal 21 and a server 22. The user may operate on the user terminal 21 to upload the CSV file, the user terminal 21 may send the CSV file to the server 22, and the server 22 may execute the method provided in this embodiment to further process the CSV file.
Optionally, the method provided in this embodiment may also be executed by the user terminal 21, for example, the user may specify a CSV file, and the user terminal 21 processes the CSV file based on the method provided in this embodiment.
Fig. 2B is a schematic diagram of a user interface shown in an exemplary embodiment of the present disclosure.
As shown in fig. 2B, the user terminal 21 may display the interface, and the user may click a button for uploading the CSV file in the interface and select a file desired to be processed, thereby transmitting the file to the server 22 side so that the server 22 can process the file.
Wherein, the CSV (Comma-Separated Values) file has the following characteristics:
1. plain text, using some character set, such as ASCII, Unicode, EBCDIC, or GB 2312;
2. consisting of records (typically one record per line);
3. each record is separated into fields by separators (typical separators are commas, semicolons or tabs; sometimes separators may include optional spaces);
4. each record has the same field sequence.
Specifically, the electronic device may receive a CSV file, for example, the server shown in fig. 2 or the electronic device may receive a CSV file.
Further, the electronic device may read the contents of the file in the CSV file, for example, may open the CSV file using a txt (text file) file format, and read the contents of the file therein. The read file contents may be stored in the form of an array of bytes.
Where a byte is the unit of information transmitted over a network (or stored in a hard disk or memory). In ASCII code, an english letter (regardless of case) occupies a space of one byte, and a chinese character occupies a space of two bytes. Symbol: an english punctuation occupies one byte and a chinese punctuation occupies two bytes. In a computer high-level language, bytes are of a minimum unit, and essentially all basic data types can be converted into byte arrays.
Step 102, determining the encoding format of the CSV file according to the byte array.
Specifically, the encoding format of the CSV file may be determined according to the byte array corresponding to the content of the CSV file. Because the byte array is obtained based on the content conversion of the CSV file, the byte array can directly embody the characteristics of the content of the CSV file.
For example, the encoding format may be determined according to the type of data in the byte array; as another example, the encoding format may be determined based on the distribution of characters in the byte array.
In an optional implementation manner, characteristics corresponding to each encoding format may also be collected in advance, and by comparing the byte array with the characteristics, the encoding format of the CSV file may also be determined.
Further, common coding formats for CSV files include:
GB2312, the information exchange uses Chinese character coding character set, is simplified Chinese coding, is suitable for the information exchange between systems such as Chinese character processing, Chinese character communication and the like;
BIG5, also known as five-code or five-code, traditional Chinese coding, is the most common computer Chinese character set standard in traditional Chinese (formal Chinese) communities;
GBK, Chinese character internal code extension standard, support simplified and traditional Chinese coding;
UTF-8Unicode, 8 bits, English full name Universal Character Set/Unicode transformation Format, is a variable length Character code for Unicode;
GB18030, information technology chinese coding character set.
Optionally, the byte array may be first subjected to validity detection, so as to determine whether the data is a legal representation. If legal, the encoding format can be further determined.
Optionally, if the data is illegal or the corresponding encoding format is not determined, the encoding format may be determined as UTF-8 by default.
And 103, determining file data in the CSV file according to the coding format.
In practical application, the content of the CSV file can be traversed according to the determined encoding format, and then the file data in the CSV file can be obtained.
Wherein, a separator is arranged in the CSV file, and each record is separated into fields by the separator. Therefore, the electronic equipment can identify the CSV file content according to the determined coding format, determine the separators in the CSV file content and read the file data in the CSV file content based on the separators.
Specifically, the CSV file content may include punctuation marks, and the punctuation marks may also be separators. Therefore, it is possible to read each delimiter appearing in the CSV file and take the delimiter of which the number is the largest as the target delimiter. And determining the file data based on the data content separated by the target separator. For example, the content between two object separators may be regarded as one file data.
Further, in the method provided in this embodiment, the read file format may be further packaged according to a preset format, for example, the file data may be filled in a table of the preset format, so as to generate the table according to the CSV file.
The method provided by the present embodiment is used for processing a CSV file, and is performed by a device provided with the method provided by the present embodiment, and the device is generally implemented in a hardware and/or software manner.
The file processing method provided by the embodiment comprises the following steps: receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array; determining the encoding format of the CSV file according to the byte array; file data in the CSV file is determined according to the encoding format. The method provided by the embodiment can convert the CSV file into the byte array, and then determines the coding format of the file based on the byte array, so that the CSV file can be accurately analyzed, and the problems of data scrambling and coding errors caused by unknown file format coding are avoided.
Fig. 3 is a flowchart illustrating a file processing method according to another exemplary embodiment of the present disclosure.
As shown in fig. 3, the method for processing a file provided by the present disclosure includes:
step 301, receiving the CSV file, and converting the file content in the CSV file into a byte array.
The detailed principle and implementation of step 301 are similar to those of step 101, and are not described herein again.
When the file content in the CSV file is converted into a byte array, if the file content comprises preset characters, the coding format of the CSV file is converted from a first type to a second type.
If the file content comprises the preset characters, the CSV file is characterized to be of a first type, and when the CSV file is of the first type, the CSV file can be converted into a second type.
Specifically, the CSV file may be converted from the first type to the second type by deleting the preset characters.
In the method provided by this embodiment, when the file content is converted into the byte array buff, if the file content is found to include the preset character, the preset character may be deleted, so as to implement the conversion of the file type. For example, if a preset character is found when the file content is read, the character in the file content can be directly deleted.
Specifically, the predetermined character may be \ ufeff. If the file content includes \ ufeff, the encoding format of the CSV file can be considered to be UTF-8+ BOM type, and at this time, the preset character in the CSV file content can be deleted, so that the UTF-8+ BOM type is converted into the UTF-8 type.
Further, when the byte array buff is generated, the byte array buff can be generated according to the file content from which the preset characters are removed.
Step 302, determining the encoding format of the CSV file according to the byte array.
Specifically, the character distribution mode in the byte array can be identified, and the encoding format of the CSV file can be determined according to the character distribution mode.
Furthermore, the byte array comprises a plurality of characters, and the characters in the byte array can be read and the distribution mode of the characters can be identified.
In practical application, the characters in the CSV files with different encoding formats are distributed in different ways. Therefore, the character distribution mode corresponding to each encoding format can be preset, so that the encoding format of the CSV file can be determined based on the character distribution mode after the character distribution mode of the byte array is identified.
Step 303, determining an interpreter according to the encoding format, and reading the delimiters included in the byte array through the interpreter.
In actual application, the interpreter corresponding to the format can be determined according to the coding format. The interpreter is capable of interpreting the array of bytes for the corresponding encoding format.
The corresponding interpreter corresponding to each encoding format can be preset, and the corresponding interpreter is directly used after the encoding format is determined. Or directly generating a corresponding interpreter according to the determined encoding format, for example, determining a character division rule in the byte array according to the encoding format, and then generating the interpreter capable of performing character division according to the rule based on the rule.
Specifically, the interpreter can process the byte array buff according to the corresponding encoding format. For example, the characters in the byte array buff are divided into lines according to the encoding format, so that each line is a record.
Furthermore, the byte array buff can be read by the interpreter, and the interpreter can interpret the byte array corresponding to the determined encoding format, so that the interpreter can accurately analyze the buff.
In practice, the separators included in the byte array may be scanned by the interpreter. In the CSV file, delimiters are provided for dividing data.
Wherein the delimiter types may include:
comma (comma) ,
Branch number
Tabulation symbol \t
Colon :
Vertical line |
It is also possible to preset the priority of each delimiter so that the delimiters are scanned one by one in the byte array buff based on the priority.
In step 304, file data in the CSV file is determined according to the read delimiters.
Specifically, if punctuation marks such as "comma" and colon "are also present in the file data of the CSV file, an error occurs in dividing the file data directly from the read separator. Therefore, the target separator can be determined according to the number of the read various separators; and determining the file data in the CSV file according to the target separator.
Further, the number of the scanned separators may be counted, and the separator with the largest number may be determined as the target separator. I.e. the delimiters used in the current CSV file. And dividing characters in the byte array based on the separator to obtain file data. For example, a character between two delimiters may be determined as one file data.
And 305, packaging the file data into a document conforming to a preset format.
In practical application, the file data can be packaged, so that a document conforming to a preset format is obtained. For example, file data may be packaged into a snapshot of an online document. And the file data can be recorded in the document according to a preset data format, that is, the packaged file data is displayed in the document according to the preset data format, for example, the formats of a reserved hyperlink, a scientific counting method, a date and the like.
Based on the method provided by the embodiment, the CSV file can be automatically converted into the document with the preset format, and the encoding format of the CSV file can be identified in the conversion process, so that the condition of messy codes is avoided.
Fig. 4 is a block diagram of a file processing device according to an exemplary embodiment of the present disclosure.
As shown in fig. 4, the file processing apparatus 40 provided in the present embodiment includes:
a conversion module 41, configured to receive a character segmentation value CSV (file, and convert file contents in the CSV file into a byte array;
a format determining module 42, configured to determine an encoding format of the CSV file according to the byte array;
and a data extraction module 43, configured to determine file data in the CSV file according to the encoding format.
The file processing device provided by the embodiment comprises: the conversion module is used for receiving the character segmentation value CSV file and converting the file content in the CSV file into a byte array; the format determining module is used for determining the encoding format of the CSV file according to the byte array; and the data extraction module is used for determining the file data in the CSV file according to the coding format. The file processing device provided by the embodiment can convert the CSV file into the byte array, and then determines the coding format of the file based on the byte array, so that the CSV file can be accurately analyzed, and the problems of data scrambling and coding errors caused by unknown file format coding are avoided.
Fig. 5 is a block diagram of a file processing apparatus according to another exemplary embodiment of the present disclosure.
As shown in fig. 5, on the basis of the embodiment shown in fig. 4, in the processing apparatus for a file provided in this embodiment, the apparatus 50 provided in this embodiment, the data extracting module 43 includes:
a reading unit 431, configured to determine an interpreter according to the encoding format, and read the delimiters included in the byte array through the interpreter;
a data reading unit 432, configured to determine file data in the CSV file according to the read delimiter.
Optionally, the format determining module 42 is specifically configured to:
and identifying a character distribution mode in the byte array, and determining the encoding format of the CSV file according to the character distribution mode.
Optionally, the data reading unit 432 is specifically configured to:
determining target separators according to the number of the read various separators;
and determining file data in the CSV file according to the target separator.
Optionally, when the file content in the CSV file is converted into the byte array, if the file content includes the preset characters, the conversion module 41 converts the encoding format of the CSV file from the first type to the second type.
The conversion module 41 is specifically configured to:
deleting the preset character, and determining the byte array according to the file content after the preset character is deleted.
Optionally, the preset character is \ ufeff.
Optionally, the apparatus further comprises a storage module 44 configured to:
packaging the file data into a document conforming to a preset format;
and displaying the packaged file data in the document according to a preset data format.
Referring to fig. 6, a schematic structural diagram of an electronic device 600 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 600 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware.
In a first aspect, according to one or more embodiments of the present disclosure, there is provided a file processing method, including:
receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array;
determining the encoding format of the CSV file according to the byte array;
and determining file data in the CSV file according to the coding format.
According to one or more embodiments of the present disclosure, the determining file data in the CSV file according to the encoding format includes:
determining an interpreter according to the encoding format, and reading the separators included in the byte array through the interpreter;
and determining file data in the CSV file according to the read delimiters.
According to one or more embodiments of the present disclosure, the determining an encoding format of the CSV file according to the byte array includes:
and identifying a character distribution mode in the byte array, and determining the encoding format of the CSV file according to the character distribution mode.
According to one or more embodiments of the present disclosure, the determining file data in the CSV file according to the read delimiter includes:
determining target separators according to the number of the read various separators;
and determining file data in the CSV file according to the target separator.
According to one or more embodiments of the present disclosure, when converting file content in the CSV file into a byte array, if the file content includes a preset character, the encoding format of the CSV file is converted from a first type to a second type.
According to one or more embodiments of the present disclosure, the converting the encoding format of the CSV file from the first type to the second type includes:
deleting the preset character, and determining the byte array according to the file content after the preset character is deleted.
In accordance with one or more embodiments of the present disclosure, the preset character is \ ufeff.
According to one or more embodiments of the present disclosure, further comprising:
packaging the file data into a document conforming to a preset format;
and displaying the packaged file data in the document according to a preset data format.
In a second aspect, according to one or more embodiments of the present disclosure, there is provided a processing apparatus of a file, including:
the conversion module is used for receiving a character segmentation value CSV file and converting file contents in the CSV file into a byte array;
the format determining module is used for determining the encoding format of the CSV file according to the byte array;
and the data extraction module is used for determining the file data in the CSV file according to the coding format.
According to one or more embodiments of the present disclosure, the data extraction module includes:
the reading unit is used for determining an interpreter according to the encoding format and reading the separators included in the byte array through the interpreter;
and the data reading unit is used for determining the file data in the CSV file according to the read separator.
According to one or more embodiments of the present disclosure, the format determination module is specifically configured to:
and identifying a character distribution mode in the byte array, and determining the encoding format of the CSV file according to the character distribution mode.
According to one or more embodiments of the present disclosure, the data reading unit is specifically configured to:
determining target separators according to the number of the read various separators;
and determining file data in the CSV file according to the target separator.
According to one or more embodiments of the present disclosure, when converting a file content in the CSV file into a byte array, if the file content includes a preset character, a conversion module converts an encoding format of the CSV file from a first type to a second type.
According to one or more embodiments of the present disclosure, the conversion module is specifically configured to delete the preset character, and determine the byte array according to the file content after the preset character is deleted.
In accordance with one or more embodiments of the present disclosure, the preset character is \ ufeff.
According to one or more embodiments of the present disclosure, the apparatus further comprises a storage module to:
packaging the file data into a document conforming to a preset format;
and displaying the packaged file data in the document according to a preset data format.
In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the first aspect and the various possible file processing methods of the first aspect as described above.
In a fourth aspect, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method for processing files is implemented as the above first aspect and various possible processing methods related to the first aspect.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; while the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims (11)

1. A method for processing a file, comprising:
receiving a character segmentation value CSV file, and converting file contents in the CSV file into a byte array;
determining the encoding format of the CSV file according to the byte array;
and determining file data in the CSV file according to the coding format.
2. The method of claim 1, wherein the determining file data in the CSV file according to the encoding format comprises:
determining an interpreter according to the encoding format, and reading the separators included in the byte array through the interpreter;
and determining file data in the CSV file according to the read delimiters.
3. The method of claim 1, wherein determining the encoding format of the CSV file according to the byte array comprises:
and identifying a character distribution mode in the byte array, and determining the encoding format of the CSV file according to the character distribution mode.
4. The method according to claim 2, wherein said determining file data in the CSV file from the read delimiters comprises:
determining target separators according to the number of the read various separators;
and determining file data in the CSV file according to the target separator.
5. The method according to any of claims 1-4, wherein when converting the file content in the CSV file into a byte array, if the file content comprises a predetermined character, the encoding format of the CSV file is converted from a first type to a second type.
6. The method of claim 5, wherein converting the encoding format of the CSV file from a first type to a second type comprises:
deleting the preset character, and determining the byte array according to the file content after the preset character is deleted.
7. The method of any of claims 1-4, 6, further comprising:
packaging the file data into a document conforming to a preset format;
and displaying the packaged file data in the document according to a preset data format.
8. A device for processing documents, comprising:
the conversion module is used for receiving a character segmentation value CSV file and converting file contents in the CSV file into a byte array;
the format determining module is used for determining the encoding format of the CSV file according to the byte array;
and the data extraction module is used for determining the file data in the CSV file according to the coding format.
9. The apparatus of claim 8, wherein the data extraction module comprises:
the reading unit is used for determining an interpreter according to the encoding format and reading the separators included in the byte array through the interpreter;
and the data reading unit is used for determining the file data in the CSV file according to the read separator.
10. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of processing a document as claimed in any one of claims 1 to 7.
11. A computer-readable storage medium, wherein a computer-executable instruction is stored in the computer-readable storage medium, and when the computer-executable instruction is executed by a processor, the method for processing a document according to any one of claims 1 to 7 is implemented.
CN202010750284.1A 2020-07-30 2020-07-30 File processing method and device and readable storage medium Pending CN111898340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010750284.1A CN111898340A (en) 2020-07-30 2020-07-30 File processing method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010750284.1A CN111898340A (en) 2020-07-30 2020-07-30 File processing method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN111898340A true CN111898340A (en) 2020-11-06

Family

ID=73182721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010750284.1A Pending CN111898340A (en) 2020-07-30 2020-07-30 File processing method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111898340A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112540958A (en) * 2020-12-08 2021-03-23 北京百度网讯科技有限公司 File processing method, device, equipment and computer storage medium
CN113595683A (en) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 Conversion processing method, device, terminal and medium based on various encoding files

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040254919A1 (en) * 2003-06-13 2004-12-16 Microsoft Corporation Log parser
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN105260422A (en) * 2015-09-28 2016-01-20 西北核技术研究所 Multi-format waveform data file batch processing method
CN106534267A (en) * 2016-10-19 2017-03-22 中国银行股份有限公司 File uploading and resolving method and device
CN108763175A (en) * 2018-06-26 2018-11-06 中国银行股份有限公司 A kind of csv file processing method and system
CN109271425A (en) * 2018-09-30 2019-01-25 北京字节跳动网络技术有限公司 It constructs the method for rumour database, analyze the method and electronic equipment of rumour data
US10204119B1 (en) * 2017-07-20 2019-02-12 Palantir Technologies, Inc. Inferring a dataset schema from input files
US20190163684A1 (en) * 2017-11-30 2019-05-30 Craig Hurlbut Method and system for converting data into a software application compatible format
CN110147536A (en) * 2019-05-24 2019-08-20 深圳市多翼创新科技有限公司 A kind of data processing method based on File Mapping, device and equipment
CN110472434A (en) * 2019-07-12 2019-11-19 北京字节跳动网络技术有限公司 Data desensitization method, system, medium and electronic equipment
CN110674199A (en) * 2019-08-13 2020-01-10 中国电建集团贵阳勘测设计研究院有限公司 Method and device for converting csv format data into SEG-2 format data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040254919A1 (en) * 2003-06-13 2004-12-16 Microsoft Corporation Log parser
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN105260422A (en) * 2015-09-28 2016-01-20 西北核技术研究所 Multi-format waveform data file batch processing method
CN106534267A (en) * 2016-10-19 2017-03-22 中国银行股份有限公司 File uploading and resolving method and device
US10204119B1 (en) * 2017-07-20 2019-02-12 Palantir Technologies, Inc. Inferring a dataset schema from input files
US20190163684A1 (en) * 2017-11-30 2019-05-30 Craig Hurlbut Method and system for converting data into a software application compatible format
CN108763175A (en) * 2018-06-26 2018-11-06 中国银行股份有限公司 A kind of csv file processing method and system
CN109271425A (en) * 2018-09-30 2019-01-25 北京字节跳动网络技术有限公司 It constructs the method for rumour database, analyze the method and electronic equipment of rumour data
CN110147536A (en) * 2019-05-24 2019-08-20 深圳市多翼创新科技有限公司 A kind of data processing method based on File Mapping, device and equipment
CN110472434A (en) * 2019-07-12 2019-11-19 北京字节跳动网络技术有限公司 Data desensitization method, system, medium and electronic equipment
CN110674199A (en) * 2019-08-13 2020-01-10 中国电建集团贵阳勘测设计研究院有限公司 Method and device for converting csv format data into SEG-2 format data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张百惠: ""面向大数据发布的保留格式加密技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李迎: "《python可视化数据分析》", 北京:中国铁道出版社, pages: 14 - 16 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112540958A (en) * 2020-12-08 2021-03-23 北京百度网讯科技有限公司 File processing method, device, equipment and computer storage medium
CN112540958B (en) * 2020-12-08 2023-08-29 北京百度网讯科技有限公司 File processing method, device, equipment and computer storage medium
CN113595683A (en) * 2021-07-07 2021-11-02 西安震有信通科技有限公司 Conversion processing method, device, terminal and medium based on various encoding files

Similar Documents

Publication Publication Date Title
CN111314388B (en) Method and apparatus for detecting SQL injection
CN111898340A (en) File processing method and device and readable storage medium
CN111046135A (en) Unstructured text processing method and device, computer equipment and storage medium
CN113657113A (en) Text processing method and device and electronic equipment
CN111325096A (en) Live stream sampling method and device and electronic equipment
CN110008807B (en) Training method, device and equipment for contract content recognition model
US8930808B2 (en) Processing rich text data for storing as legacy data records in a data storage system
CN112487765B (en) Method and device for generating notification text
CN114547040A (en) Data processing method, device, equipment and medium
CN110780898B (en) Page data upgrading method and device and electronic equipment
CN111027281B (en) Word segmentation method, device, equipment and storage medium
CN112363693A (en) Code text processing method, device, equipment and storage medium
CN114239501A (en) Contract generation method, apparatus, device and medium
CN113807056A (en) Method, device and equipment for correcting error of document name sequence number
CN109426357B (en) Information input method and device
CN112364621A (en) Method and system for analyzing rule text based on RUTA rule language
CN110852042A (en) Character type conversion method and device
CN111079185A (en) Database information processing method and device, storage medium and electronic equipment
CN111581331B (en) Method, device, electronic equipment and computer readable medium for processing text
CN110287147B (en) Character string sorting method and device
US11379664B2 (en) Method for acquiring a parallel corpus, electronic device, and storage medium
CN110891010B (en) Method and apparatus for transmitting information
CN110706309B (en) Method and device for generating fishbone map
CN117707541A (en) Character string data collection method, device, electronic equipment and computer readable medium
CN116108809A (en) Message data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination