CN112749125A - Text processing method and device and text processing system - Google Patents

Text processing method and device and text processing system Download PDF

Info

Publication number
CN112749125A
CN112749125A CN202110045135.XA CN202110045135A CN112749125A CN 112749125 A CN112749125 A CN 112749125A CN 202110045135 A CN202110045135 A CN 202110045135A CN 112749125 A CN112749125 A CN 112749125A
Authority
CN
China
Prior art keywords
file
index
text
strategy
scanning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110045135.XA
Other languages
Chinese (zh)
Other versions
CN112749125B (en
Inventor
王淇
赵晶
王志海
喻波
安鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN202110045135.XA priority Critical patent/CN112749125B/en
Priority claimed from CN202110045135.XA external-priority patent/CN112749125B/en
Publication of CN112749125A publication Critical patent/CN112749125A/en
Application granted granted Critical
Publication of CN112749125B publication Critical patent/CN112749125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Abstract

The invention discloses a text processing method and device and a text processing system. Wherein, the method comprises the following steps: receiving a text to be processed; scanning a text, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; and respectively carrying out partition analysis on the contents of each file obtained by positioning by adopting multiple threads. The invention solves the technical problems that in the prior art, file information is scanned by adopting a breakpoint analysis technology, a fault working point is positioned, a scanning result is persisted to a database, and the system throughput is poor due to frequent operation of the database in a scanning and warehousing process.

Description

Text processing method and device and text processing system
Technical Field
The invention relates to the technical field of data processing, in particular to a text processing method and device and a text processing system.
Background
After the system failure is recovered, the recovered system can be quickly positioned to obtain a working point before the failure by some technical means, and the working point is taken as a starting point to continue working; that is, in the case of interruption of file analysis due to a program failure or the like, the interruption position can be quickly located after the failure is recovered, and file analysis can be continued from this position, instead of from the beginning, thereby achieving high availability of the system. However, the implementation of similar technologies in the market is based on database persistence, where scanned file information is persisted to a database, and after a failure is recovered, a working point before the failure is obtained by reading the database, so as to continue working; in the method, information needs to be scanned and stored in a database, and the performance of the database is reduced due to frequent operation of the database, so that the overall throughput of the system is influenced.
Aiming at the problems that in the prior art, file information is scanned by adopting a breakpoint analysis technology, a fault working point is positioned, a scanning result is persisted to a database, and the system throughput is poor due to frequent operation of the database in a scanning and warehousing process, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a text processing method and device and a text processing system, which at least solve the technical problems that in the prior art, file information is scanned by adopting a breakpoint analysis technology, a fault working point is positioned, a scanning result is persisted to a database, and the system throughput is poor due to frequent operation of the database in a scanning and warehousing process.
According to an aspect of an embodiment of the present invention, there is provided a text processing method, including: receiving a text to be processed; scanning the text, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; loading the index file, and positioning the file based on the index information in the index file to obtain corresponding file content; and respectively carrying out partition analysis on the contents of each file obtained by positioning by adopting multiple threads.
Optionally, before scanning the text, the text processing method further includes: receiving a distributed configuration file, wherein the configuration file comprises: the system comprises a slicing strategy file used for determining the slicing strategy and a strategy file slicing strategy used for determining the index file.
Optionally, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is periodically updated.
Optionally, the index file includes at least the following two fields: the initial position and the offset of the storage address for positioning the file content are used, wherein the file content at least comprises the following two fields: index codes and data meta information.
According to another aspect of the embodiments of the present invention, there is provided a text processing apparatus, including: the receiving unit is used for receiving the text to be processed; a scanning unit, configured to scan the text, perform fragment storage on the text using a fragment policy, and obtain at least one fragment file, where the fragment file includes: a plurality of fragmented contents, an index file composed of index information of each fragmented content; the acquisition unit is used for loading the index file, and positioning the file based on the index information in the index file to obtain the corresponding file content; and the analysis unit is used for respectively carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
Optionally, the processing apparatus of the text further includes: the receiving unit is configured to receive a configuration file sent down before scanning the text, where the configuration file includes: the system comprises a slicing strategy file used for determining the slicing strategy and a strategy file slicing strategy used for determining the index file.
Optionally, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is periodically updated.
Optionally, the index file includes at least the following two fields: the initial position and the offset of the storage address for positioning the file content are used, wherein the file content at least comprises the following two fields: index codes and data meta information.
According to another aspect of the embodiments of the present invention, there is also provided a text processing system, including: a control subsystem for providing a configuration file, wherein the configuration file comprises: the strategy file fragmentation strategy file is used for determining the fragmentation strategy and the strategy file fragmentation strategy is used for determining the index file; a file scanning subsystem, which is communicated with the control subsystem, and is used for scanning a text to be processed, and using the fragmentation strategy to perform fragmentation storage on the text to obtain at least one fragmentation file, wherein the fragmentation file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; and the file analysis subsystem is used for loading the index file, positioning the file based on the index information in the index file to obtain corresponding file contents, and respectively carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
Optionally, the system for processing text further comprises: and the configuration center is respectively communicated with the control subsystem, the file scanning subsystem and the file analyzing subsystem, and is used for receiving and storing the configuration file issued by the control subsystem and providing the configuration file to the file scanning subsystem and the file analyzing subsystem.
Optionally, the system for processing text further comprises: and the file server is communicated with the file scanning subsystem and is used for sending the files under the formulated directory to the file scanning subsystem.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, which includes a stored computer program, wherein when the computer program is executed by a processor, the computer program controls an apparatus in which the computer storage medium is located to execute any one of the above text processing methods.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a computer program, where the computer program executes to perform the text processing method according to any one of the above.
In the embodiment of the invention, the text to be processed is received; scanning a text, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; the method for processing the text realizes the purpose of accelerating the file information positioning by introducing the index file, achieves the technical effect of improving the system throughput, and further solves the technical problems that in the prior art, the file information is scanned by adopting a breakpoint analysis technology, a fault working point is positioned, and the scanning result is persisted to a database, and the system throughput is poor due to frequent operation of the database in the scanning and warehousing process.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a method of processing text according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a system for processing text according to an embodiment of the invention;
FIG. 3 is a schematic diagram of scanning a document according to an embodiment of the invention;
FIG. 4 is a schematic diagram of file parsing according to an embodiment of the invention;
FIG. 5 is a schematic diagram of a method of processing text according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a text processing apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a system for processing text in accordance with an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided a method embodiment of a method of processing text, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than presented herein.
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present invention, and as shown in fig. 1, the text processing method includes the following steps:
step S102, receiving a text to be processed.
In this embodiment, the text to be processed may be received first.
Step S104, scanning the text, and storing the text in a fragmentation mode by using a fragmentation strategy to obtain at least one fragmentation file, wherein the fragmentation file comprises: a plurality of fragmented contents, and an index file composed of index information for each fragmented content.
In this embodiment, the received text to be processed may be scanned, and a fragmentation policy is adopted to perform fragmentation storage on the scanning result, and then at least one fragmentation file is obtained.
And step S106, loading the index file, and positioning the file based on the index information in the index file to obtain the corresponding file content.
In this embodiment, the index file may be loaded, and file location may be performed based on index information in the index file to obtain corresponding file content.
And step S108, respectively carrying out partition analysis on each file content obtained by positioning by adopting multiple threads.
In this embodiment, multithreading may be adopted to perform partition parsing on each file content obtained by positioning.
As can be seen from the above, in the embodiment of the present invention, a text to be processed may be received; scanning a text, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; and the multithreading is adopted to respectively perform partition analysis on the content of each file obtained by positioning, so that the aim of accelerating the positioning of file information by introducing an index file is fulfilled, and the technical effect of improving the system throughput is achieved.
Therefore, the text processing method provided by the embodiment of the invention solves the technical problems that in the prior art, a breakpoint analysis technology is adopted to scan file information, a fault working point is positioned, a scanning result is persisted to a database, and the database is frequently operated in a scanning and warehousing process, so that the system throughput is poor.
Fig. 2 is a schematic diagram of a text processing system according to an embodiment of the present invention, and as shown in fig. 2, the system is mainly completed by three subsystems working cooperatively, respectively: the system comprises a console subsystem, a file scanning subsystem and a file analyzing subsystem; the file server is communicated with the file scanning subsystem, and the file scanning subsystem can perform fragment storage on the read file information and maintain an index file according to an index fragment strategy issued by the console subsystem; the console subsystem can provide a visual interface for system configuration (such as strategy information, action types and scanning index file distribution strategies) parameter issuing and background management; the file analysis subsystem can read the index file according to the fragment strategy, position specific file information according to the index, further obtain a source file for analysis, and report the file information hitting the sensitive strategy to the console.
The index fragmentation divides a large index file into a plurality of small index files according to a certain strategy, so that the maintenance is convenient, and the data retrieval efficiency can be improved.
In an optional embodiment, before scanning the text, the method for processing the text may further include: receiving the issued configuration file, wherein the configuration file comprises: the file fragmentation strategy file is used for determining the fragmentation strategy and the strategy file fragmentation strategy is used for determining the index file.
In this embodiment, the console subsystem may issue a configuration file to the configuration center, for example, a policy file fragmentation policy and system configuration information for determining a fragmentation policy file and an index file.
That is, in the embodiment of the present invention, the user may issue parameters such as system configuration information, policy information, and index file fragmentation policy to the configuration center through the console.
In an alternative embodiment, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is periodically updated.
In the embodiment, each subsystem and the configuration center can interactively acquire parameter information; for example, the file scanning subsystem may obtain system configuration information and an index file fragmentation policy from the configuration center, and the file parsing subsystem may obtain the system configuration information and the policy information from the configuration center; that is, in the embodiment of the present invention, each subsystem (e.g., the file scanning subsystem and the file parsing subsystem) may obtain corresponding configuration information from the configuration center and update the configuration information.
In an alternative embodiment, the index file includes at least the following two fields: the method comprises the following steps of locating an initial position and an offset of a storage address of file content, wherein the file content at least comprises the following two fields: index codes and data meta information.
In this embodiment, the file scanning subsystem may perform fragment storage on the scanned file and maintain corresponding index information according to a fragment policy issued by the console, where the index file mainly includes two fields: offset, and the starting position of the record stored in the data file; the data file mainly contains two fields: index code offset, data meta information datainfo.
Fig. 3 is a schematic diagram of scanning a file according to an embodiment of the present invention, and as shown in fig. 3, a file server obtains file information under a specified target and sends the obtained file to a file scanning subsystem; the file scanning subsystem can write file information into the data file and maintain corresponding index information according to the fragmentation strategy, and then can issue strategy information, an index file fragmentation strategy and system configuration information to the configuration center through the console.
Fig. 4 is a schematic diagram of file parsing according to an embodiment of the present invention, and as shown in fig. 4, after a data file is searched based on an index file, the data file is sent to a file parsing subsystem, and the file parsing subsystem determines whether to perform hit processing, and reports the hit processing to a console. Finally, the file analysis subsystem rapidly positions file information by loading the index file, multi-thread partition analysis is achieved, and the overall throughput of the system is improved.
Fig. 5 is a schematic diagram of a text processing method according to an embodiment of the present invention, and as shown in fig. 5, a DLP background management system in a data leakage prevention DLP console issues policy information, an index file fragmentation policy, and system configuration information to a configuration center; the system configuration center sends the received information to the file scanning subsystem, and the file scanning subsystem can store the read file information in a slicing mode and maintain the index file according to the index slicing strategy of the console hair washing.
The text processing method provided by the embodiment of the invention effectively solves the performance bottleneck problem caused by scanning and analyzing massive files in the prior art, and provides a solution for analyzing interruption caused by various reasons such as program faults and the like in the scanning and analyzing process and quickly positioning breakpoints after the faults are recovered. Compared with the performance bottleneck of traditional file scanning and analysis, the text processing method provided by the embodiment of the invention greatly enhances the overall throughput of the system by adopting a partition scanning strategy; index files are introduced, so that the file information positioning is accelerated, and the rapid scanning is realized; when the system crashes in the analysis process, the system can continue to analyze through the offset parameter after being restarted, and breakpoint analysis is realized.
Example 2
According to another aspect of the embodiment of the present invention, there is also provided a text processing apparatus, and fig. 6 is a schematic diagram of a text processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, the text processing apparatus may include: a receiving unit 61, a scanning unit 63, an acquiring unit 65, and an analyzing unit 67. The following describes a processing apparatus for this document.
A receiving unit 61, configured to receive a text to be processed.
The scanning unit 63 is configured to scan a text, perform fragment storage on the text by using a fragment policy, and acquire at least one fragment file, where the fragment file includes: a plurality of fragmented contents, and an index file composed of index information for each fragmented content.
The obtaining unit 65 is configured to load an index file, perform file location based on index information in the index file, and obtain corresponding file content.
And the analysis unit 67 is configured to perform partition analysis on each file content obtained by positioning by using multiple threads.
It should be noted that the receiving unit 61, the scanning unit 63, the obtaining unit 65 and the analyzing unit 67 correspond to steps S102 to S108 in embodiment 1, and the units are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above-described elements as part of an apparatus may be implemented in a computer system, such as a set of computer-executable instructions.
As can be seen from the above, in the above embodiments of the present application, the receiving unit may be utilized to receive the text to be processed; then, scanning the text by using a scanning unit, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; then, loading an index file by using an acquisition unit, and positioning the file based on index information in the index file to obtain corresponding file content; and respectively carrying out partition analysis on each file content obtained by positioning by using an analysis unit through multiple threads. The text processing device provided by the embodiment of the invention realizes the purpose of accelerating the positioning of the file information by introducing the index file, achieves the technical effect of improving the system throughput, and solves the technical problems that the file information is scanned by adopting a breakpoint analysis technology, the fault working point is positioned, the scanning result is persisted to a database, and the system throughput is poor due to frequent operation of the database in the scanning and warehousing process in the prior art.
In an optional embodiment, the text processing apparatus further includes: a receiving unit, configured to receive a delivered configuration file before scanning a text, where the configuration file includes: the file fragmentation strategy file is used for determining the fragmentation strategy and the strategy file fragmentation strategy is used for determining the index file.
In an alternative embodiment, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is periodically updated.
In an alternative embodiment, the index file includes at least the following two fields: the method comprises the following steps of locating an initial position and an offset of a storage address of file content, wherein the file content at least comprises the following two fields: index codes and data meta information.
Example 3
According to another aspect of the embodiment of the present invention, there is also provided a system for processing a text, and fig. 7 is a schematic diagram of a system for processing a text according to an embodiment of the present invention, as shown in fig. 7, the system for processing a text includes:
a control subsystem 71 for providing a configuration file, wherein the configuration file comprises: the file fragmentation strategy file is used for determining the fragmentation strategy and the strategy file fragmentation strategy is used for determining the index file.
The file scanning subsystem 73 is in communication with the control subsystem, and is configured to scan a text to be processed, perform fragment storage on the text using a fragment policy, and obtain at least one fragment file, where the fragment file includes: a plurality of fragmented contents, and an index file composed of index information for each fragmented content.
The file analysis subsystem 75 is configured to load an index file, perform file positioning based on index information in the index file to obtain corresponding file contents, and perform partition analysis on each file content obtained by positioning by using multiple threads.
Through the text processing system provided by the embodiment of the invention, the control subsystem can be used for providing the configuration file, wherein the configuration file comprises the following components: the method comprises the steps of determining a fragmentation strategy file of a fragmentation strategy and determining a strategy file fragmentation strategy of an index file; then, a file scanning subsystem communicated with the control subsystem is used for scanning the text to be processed, a fragmentation strategy is used for carrying out fragmentation storage on the text, and at least one fragmentation file is obtained, wherein the fragmentation file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content; the file analysis subsystem is used for loading the index file, file positioning is carried out based on index information in the index file to obtain corresponding file contents, multithreading is adopted for carrying out partition analysis on each file content obtained through positioning, the aim of accelerating the file information positioning by introducing the index file is achieved, the technical effect of improving the system throughput is achieved, and the technical problem that the system throughput is poor due to the fact that the database is frequently operated in the process of scanning and warehousing due to the fact that the breakpoint analysis technology is adopted to scan the file information, the working point with faults is located and the scanning result is persisted to the database in the prior art is solved.
In an optional embodiment, the system for processing text further comprises: and the configuration center is respectively communicated with the control subsystem, the file scanning subsystem and the file analyzing subsystem, and is used for receiving and storing the configuration file issued by the control subsystem and providing the configuration file to the file scanning subsystem and the file analyzing subsystem.
In an optional embodiment, the system for processing text further comprises: and the file server is communicated with the file scanning subsystem and is used for sending the files under the formulated directory to the file scanning subsystem.
Example 4
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium including a stored computer program, wherein when the computer program is executed by a processor, the apparatus on which the computer storage medium is located is controlled to execute the method for processing the text in any one of the above.
Example 5
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a computer program, where the computer program executes a method for processing a text in any one of the above.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (13)

1. A method for processing text, comprising:
receiving a text to be processed;
scanning the text, and performing fragment storage on the text by using a fragment strategy to acquire at least one fragment file, wherein the fragment file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content;
loading the index file, and positioning the file based on the index information in the index file to obtain corresponding file content;
and respectively carrying out partition analysis on the contents of each file obtained by positioning by adopting multiple threads.
2. The method of claim 1, wherein prior to scanning the text, the method further comprises: receiving a distributed configuration file, wherein the configuration file comprises: the system comprises a slicing strategy file used for determining the slicing strategy and a strategy file slicing strategy used for determining the index file.
3. The method of claim 2, wherein the configuration file is obtained from a configuration center and the configuration information in the configuration file is periodically updated.
4. The method of claim 1, wherein the index file comprises at least the following two fields: the initial position and the offset of the storage address for positioning the file content are used, wherein the file content at least comprises the following two fields: index codes and data meta information.
5. A text processing apparatus, comprising:
the receiving unit is used for receiving the text to be processed;
a scanning unit, configured to scan the text, perform fragment storage on the text using a fragment policy, and obtain at least one fragment file, where the fragment file includes: a plurality of fragmented contents, an index file composed of index information of each fragmented content;
the acquisition unit is used for loading the index file, and positioning the file based on the index information in the index file to obtain the corresponding file content;
and the analysis unit is used for respectively carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
6. The apparatus of claim 5, further comprising: the receiving unit is configured to receive a configuration file sent down before scanning the text, where the configuration file includes: the system comprises a slicing strategy file used for determining the slicing strategy and a strategy file slicing strategy used for determining the index file.
7. The apparatus of claim 6, wherein the configuration file is obtained from a configuration center, and the configuration information in the configuration file is periodically updated.
8. The apparatus of claim 5, wherein the index file comprises at least the following two fields: the initial position and the offset of the storage address for positioning the file content are used, wherein the file content at least comprises the following two fields: index codes and data meta information.
9. A system for processing text, comprising:
a control subsystem for providing a configuration file, wherein the configuration file comprises: the strategy file fragmentation strategy file is used for determining the fragmentation strategy and the strategy file fragmentation strategy is used for determining the index file;
a file scanning subsystem, which is communicated with the control subsystem, and is used for scanning a text to be processed, and using the fragmentation strategy to perform fragmentation storage on the text to obtain at least one fragmentation file, wherein the fragmentation file comprises: a plurality of fragmented contents, an index file composed of index information of each fragmented content;
and the file analysis subsystem is used for loading the index file, positioning the file based on the index information in the index file to obtain corresponding file contents, and respectively carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
10. The system for processing text according to claim 9, further comprising:
and the configuration center is respectively communicated with the control subsystem, the file scanning subsystem and the file analyzing subsystem, and is used for receiving and storing the configuration file issued by the control subsystem and providing the configuration file to the file scanning subsystem and the file analyzing subsystem.
11. The system for processing text according to claim 9, further comprising:
and the file server is communicated with the file scanning subsystem and is used for sending the files under the formulated directory to the file scanning subsystem.
12. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed by a processor, controls an apparatus in which the computer storage medium resides to perform the method of processing text as claimed in any one of claims 1 to 4.
13. A processor, characterized in that the processor is configured to run a computer program, wherein the computer program is configured to execute the method for processing text according to any one of claims 1 to 4 when running.
CN202110045135.XA 2021-01-13 Text processing method and device and text processing system Active CN112749125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110045135.XA CN112749125B (en) 2021-01-13 Text processing method and device and text processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110045135.XA CN112749125B (en) 2021-01-13 Text processing method and device and text processing system

Publications (2)

Publication Number Publication Date
CN112749125A true CN112749125A (en) 2021-05-04
CN112749125B CN112749125B (en) 2024-05-03

Family

ID=

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468119A (en) * 2021-05-31 2021-10-01 北京明朝万达科技股份有限公司 File scanning method and device
CN113836088A (en) * 2021-08-31 2021-12-24 北京明朝万达科技股份有限公司 File processing method, system and device based on depth scanning and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101193273A (en) * 2006-11-20 2008-06-04 中兴通讯股份有限公司 A storage and playing method for real time multimedia image information
US9483568B1 (en) * 2013-06-05 2016-11-01 Google Inc. Indexing system
CN108089977A (en) * 2017-11-28 2018-05-29 维沃移动通信有限公司 A kind of abnormality eliminating method of application program, device and mobile terminal
CN111367860A (en) * 2018-12-26 2020-07-03 北京奇虎科技有限公司 File refreshing method and device
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method and device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101193273A (en) * 2006-11-20 2008-06-04 中兴通讯股份有限公司 A storage and playing method for real time multimedia image information
US9483568B1 (en) * 2013-06-05 2016-11-01 Google Inc. Indexing system
CN108089977A (en) * 2017-11-28 2018-05-29 维沃移动通信有限公司 A kind of abnormality eliminating method of application program, device and mobile terminal
CN111367860A (en) * 2018-12-26 2020-07-03 北京奇虎科技有限公司 File refreshing method and device
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method and device, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468119A (en) * 2021-05-31 2021-10-01 北京明朝万达科技股份有限公司 File scanning method and device
CN113836088A (en) * 2021-08-31 2021-12-24 北京明朝万达科技股份有限公司 File processing method, system and device based on depth scanning and storage medium

Similar Documents

Publication Publication Date Title
US9575984B2 (en) Similarity analysis method, apparatus, and system
US9619512B2 (en) Memory searching system and method, real-time searching system and method, and computer storage medium
CN105512283A (en) Data quality management and control method and device
CN111291079A (en) Data query method and device
CN103020521B (en) Wooden horse scan method and system
CN108376171B (en) Method and device for quickly importing big data, terminal equipment and storage medium
US20150066877A1 (en) Segment combining for deduplication
CN109271545B (en) Feature retrieval method and device, storage medium and computer equipment
CN105260639A (en) Face recognition system data update method and device
CN113297182A (en) Data migration method, device, storage medium and program product
CN110888837A (en) Object storage small file merging method and device
CN110795614A (en) Index automatic optimization method and device
CN111913925A (en) Data processing method and system in distributed storage system
US20190179804A1 (en) Tracking file movement in a network environment
CN114490554A (en) Data synchronization method and device, electronic equipment and storage medium
CN108255703B (en) SQL script fault repairing method and terminal thereof
CN112749125A (en) Text processing method and device and text processing system
CN112749125B (en) Text processing method and device and text processing system
CN110222046B (en) List data processing method, device, server and storage medium
US10430379B2 (en) Identifying common file-segment sequences
CN110851437A (en) Storage method, device and equipment
CN111385613A (en) Television system repairing method, storage medium and application server
US11132335B2 (en) Systems and methods for file fingerprinting
CN112783835A (en) Index management method and device and electronic equipment
CN105657473A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant