CN112749125B - Text processing method and device and text processing system - Google Patents

Text processing method and device and text processing system Download PDF

Info

Publication number
CN112749125B
CN112749125B CN202110045135.XA CN202110045135A CN112749125B CN 112749125 B CN112749125 B CN 112749125B CN 202110045135 A CN202110045135 A CN 202110045135A CN 112749125 B CN112749125 B CN 112749125B
Authority
CN
China
Prior art keywords
file
index
slicing
text
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110045135.XA
Other languages
Chinese (zh)
Other versions
CN112749125A (en
Inventor
王淇
赵晶
王志海
喻波
安鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wondersoft Technology Co Ltd
Original Assignee
Beijing Wondersoft Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wondersoft Technology Co Ltd filed Critical Beijing Wondersoft Technology Co Ltd
Priority to CN202110045135.XA priority Critical patent/CN112749125B/en
Publication of CN112749125A publication Critical patent/CN112749125A/en
Application granted granted Critical
Publication of CN112749125B publication Critical patent/CN112749125B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text processing method and device and a text processing system. Wherein the method comprises the following steps: receiving text to be processed; scanning a text, and performing slicing storage on the text by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; and carrying out partition analysis on each file content obtained by positioning by adopting multithreading. The invention solves the technical problems of the prior art that the breakpoint analysis technology is adopted to scan file information, locate the working point with faults, and persist the scanning result to the database, and the database is frequently operated in the scanning and warehousing process, so that the throughput of the system is poor.

Description

Text processing method and device and text processing system
Technical Field
The invention relates to the technical field of data processing, in particular to a text processing method and device and a text processing system.
Background
The core function of the breakpoint analysis technology can be summarized as that after the system is recovered from faults, the recovered system can be rapidly positioned to obtain a working point before the faults by some technical means, and the working point is taken as a starting point to continue working; that is, for interruption of file analysis due to program failure or the like, the interruption position can be quickly located after the failure is recovered and file analysis can be continued from this position, instead of starting from the beginning, thereby achieving high availability of the system. However, the implementation of similar technologies on the market at present is based on database persistence, scanned file information is persistence to a database, and after fault recovery, the database is read to obtain the working point before the fault, so that the work is continued; in the mode, the information is required to be scanned and put into storage, the database is frequently operated, and the performance of massive data is reduced, so that the overall throughput of the system is affected.
Aiming at the problems that in the prior art, a breakpoint analysis technology is adopted to scan file information, locate a working point with a fault and persistence the scanning result to a database, and the database is frequently operated in the scanning and warehousing process, so that the throughput of a system is poor, no effective solution is proposed at present.
Disclosure of Invention
The embodiment of the invention provides a text processing method and device and a text processing system, which at least solve the technical problems that in the prior art, a breakpoint analysis technology is adopted to scan file information, a working point with a fault is positioned, a scanning result is persisted to a database, and the database is frequently operated in a scanning and warehousing process, so that the throughput of the system is poor.
According to an aspect of an embodiment of the present invention, there is provided a text processing method, including: receiving text to be processed; scanning the text, and storing the text in a slicing way by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; loading the index file, and positioning the file based on index information in the index file to obtain corresponding file content; and carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
Optionally, before scanning the text, the processing method of the text further includes: receiving an issued configuration file, wherein the configuration file comprises: a sharding policy file for determining the sharding policy, and a policy file sharding policy for determining the index file.
Optionally, the configuration file is obtained from a configuration center, and configuration information in the configuration file is updated periodically.
Optionally, the index file includes at least two fields: the initial position and the offset of the storage address are used for positioning the file content, wherein the file content at least comprises the following two fields: index code and data meta information.
According to another aspect of the embodiment of the present invention, there is also provided a text processing apparatus, including: a receiving unit for receiving text to be processed; the scanning unit is used for scanning the text, and using a slicing strategy to store the text in a slicing way to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; the acquisition unit is used for loading the index file, and carrying out file positioning based on index information in the index file to obtain corresponding file content; and the analysis unit is used for carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
Optionally, the text processing device further includes: the receiving unit is configured to receive, before scanning the text, a configuration file sent down, where the configuration file includes: a sharding policy file for determining the sharding policy, and a policy file sharding policy for determining the index file.
Optionally, the configuration file is obtained from a configuration center, and configuration information in the configuration file is updated periodically.
Optionally, the index file includes at least two fields: the initial position and the offset of the storage address are used for positioning the file content, wherein the file content at least comprises the following two fields: index code and data meta information.
According to another aspect of the embodiment of the present invention, there is also provided a text processing system, including: and the control subsystem is used for providing a configuration file, wherein the configuration file comprises: a sharding policy file for determining the sharding policy, a policy file sharding policy for determining the index file; the file scanning subsystem is communicated with the control subsystem and is used for scanning a text to be processed, and the text is stored in a slicing mode by using the slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; and the file analysis subsystem is used for loading the index file, carrying out file positioning based on index information in the index file to obtain corresponding file contents, and carrying out partition analysis on each file content obtained by positioning by adopting multiple threads.
Optionally, the text processing system further includes: the configuration center is respectively provided with a communication relationship among the control subsystem, the file scanning subsystem and the file analysis subsystem, and is used for receiving and storing the configuration file issued by the control subsystem and providing the configuration file for the file scanning subsystem and the file analysis subsystem.
Optionally, the text processing system further includes: and the file server is communicated with the file scanning subsystem and is used for sending the files under the established directory to the file scanning subsystem.
According to another aspect of the embodiments of the present invention, there is provided a computer readable storage medium including a stored computer program, wherein the computer program, when executed by a processor, controls a device in which the computer storage medium is located to perform the method for processing text according to any one of the above.
According to another aspect of the embodiment of the present invention, there is provided a processor, configured to execute a computer program, where the computer program executes a method for processing text according to any one of the above methods.
In the embodiment of the invention, receiving the text to be processed is adopted; scanning a text, and performing slicing storage on the text by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; the method for processing the text provided by the embodiment of the invention realizes the aim of accelerating file information positioning by introducing the index file, achieves the technical effect of improving the throughput of a system, further solves the technical problems of poor system throughput caused by scanning file information, positioning a working point with faults and persistence of a scanning result to a database by adopting a breakpoint analysis technology in the prior art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of a method of processing text according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a text processing system according to an embodiment of the invention;
FIG. 3 is a schematic illustration of a scanned document according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of file parsing according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a text processing method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a text processing device according to an embodiment of the invention;
FIG. 7 is a schematic diagram of a text processing system according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to an embodiment of the present invention, there is provided a method embodiment of a text processing method, it should be noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different from that herein.
Fig. 1 is a flowchart of a text processing method according to an embodiment of the present invention, as shown in fig. 1, the text processing method including the steps of:
Step S102, receiving the text to be processed.
In this embodiment, text to be processed may be received first.
Step S104, scanning the text, and using a slicing strategy to store the text in a slicing way to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, and an index file composed of index information of each piece of content.
In this embodiment, the received text to be processed may be scanned, and the scanning result may be stored in a slicing manner by using a slicing policy, and then at least one slicing file may be acquired.
And S106, loading the index file, and positioning the file based on the index information in the index file to obtain the corresponding file content.
In this embodiment, the index file may be loaded, and file positioning may be performed based on index information in the index file, so as to obtain the corresponding file content.
And S108, carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
In this embodiment, multithreading may be used to partition the content of each located file.
As can be seen from the above, in the embodiment of the present invention, text to be processed may be received; scanning a text, and performing slicing storage on the text by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; loading an index file, and positioning the file based on index information in the index file to obtain corresponding file content; each file content obtained by positioning is respectively subjected to partition analysis by adopting multithreading, so that the aim of accelerating file information positioning by introducing index files is fulfilled, and the technical effect of improving the throughput of the system is achieved.
Therefore, the text processing method provided by the embodiment of the invention solves the technical problems of poor system throughput caused by the frequent operation of the database in the scanning and warehousing process by adopting the breakpoint analysis technology to scan file information, locating the working point with faults and persisting the scanning result to the database in the prior art.
FIG. 2 is a schematic diagram of a text processing system according to an embodiment of the present invention, and as shown in FIG. 2, the system is mainly implemented by three subsystems working cooperatively, respectively: the system comprises a control desk subsystem, a file scanning subsystem and a file analyzing subsystem; the file scanning subsystem can store the read file information in a slicing way according to the index slicing strategy issued by the control console subsystem and maintain an index file; the console subsystem may provide system configuration (e.g., policy information, action type, scan index file allocation policy) parameter delivery and background management visualization interfaces; the file analysis subsystem can read the index file according to the slicing strategy, locate specific file information according to the index, further obtain a source file for analysis, and report the file information hitting the sensitive strategy to the console.
The index slicing is to split a large index file into a plurality of small index files according to a certain strategy, so that the maintenance is convenient, and the data retrieval efficiency can be improved.
In an alternative embodiment, before scanning the text, the text processing method may further include: receiving an issued configuration file, wherein the configuration file comprises: a sharding policy file for determining a sharding policy, a policy file sharding policy for determining an index file.
In this embodiment, the console subsystem may issue configuration files, e.g., policy file sharding policies for determining sharding policy files, index files, system configuration information, to the configuration center.
That is, in the embodiment of the present invention, a user may send parameters such as system configuration information, policy information, and index file slicing policy to a configuration center through a console.
In an alternative embodiment, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is updated periodically.
In this embodiment, each subsystem may interactively acquire parameter information with the configuration center; for example, the file scanning subsystem may obtain system configuration information from the configuration center, index file fragmentation policies, and the file parsing subsystem may obtain system configuration information and policy information from the configuration center; that is, in the embodiment of the present invention, each subsystem (e.g., a file scanning subsystem, a file parsing subsystem) may acquire corresponding configuration information from the configuration center and update the configuration information.
In an alternative embodiment, the index file includes at least two fields: an initial location and an offset for locating a storage address of a file content, wherein the file content includes at least two fields: index code and data meta information.
In this embodiment, the file scanning subsystem may store the scanned file in slices according to the slicing policy issued by the console, and maintain corresponding index information, where the index file mainly includes two fields: offset, the initial position of the record is stored in the data file; the data file mainly contains two fields: index code offset, data meta-information datainfo.
FIG. 3 is a schematic diagram of a scanned file according to an embodiment of the present invention, as shown in FIG. 3, where a file server obtains file information under a specified target and issues the obtained file to a file scanning subsystem; the file scanning subsystem can write file information into the data file according to the slicing strategy and maintain corresponding index information, and then can issue strategy information, index file slicing strategy and system configuration information to the configuration center through the console.
FIG. 4 is a schematic diagram of file parsing according to an embodiment of the present invention, as shown in FIG. 4, after searching for a data file from an index-based file, the data file is sent to a file parsing subsystem, which determines whether a hit is processed, and reports the result to a console. Finally, the file analysis subsystem rapidly locates file information by loading the index file, achieves multi-thread partition analysis, and improves the overall throughput of the system.
FIG. 5 is a schematic diagram of a text processing method according to an embodiment of the present invention, where as shown in FIG. 5, a DLP background management system in a data leakage prevention DLP console issues policy information, index file slicing policy, and system configuration information to a configuration center; the system configuration center can send the received information to the file scanning subsystem, and the file scanning subsystem can store the read file information in a slicing way according to the index slicing strategy of the console shampoo and maintain the index file.
The text processing method provided by the embodiment of the invention effectively solves the performance bottleneck problem caused by scanning analysis of a large number of files in the prior art, and provides a solution for resolving interruption caused by various reasons such as program faults and the like in the scanning analysis process and rapidly positioning breakpoints after fault recovery. Compared with the performance bottleneck of traditional file scanning and analysis, the text processing method provided by the embodiment of the invention greatly enhances the overall throughput of the system by adopting a partition scanning strategy; index files are introduced, file information positioning is quickened, and quick scanning is realized; when the system is crashed in the analysis process, the analysis can be continued through the offset parameter after restarting, so that the breakpoint analysis is realized.
Example 2
According to another aspect of the embodiment of the present invention, there is provided a text processing apparatus, and fig. 6 is a schematic diagram of the text processing apparatus according to an embodiment of the present invention, and as shown in fig. 6, the text processing apparatus may include: receiving section 61, scanning section 63, acquiring section 65, and analyzing section 67. The processing device of this document will be described below.
A receiving unit 61 for receiving text to be processed.
The scanning unit 63 is configured to scan the text, and store the text in a slicing manner by using a slicing policy, so as to obtain at least one slicing file, where the slicing file includes: a plurality of pieces of content, and an index file composed of index information of each piece of content.
The obtaining unit 65 is configured to load the index file, and perform file positioning based on index information in the index file, so as to obtain corresponding file content.
The parsing unit 67 is configured to perform partition parsing on each located file content by using multiple threads.
Here, the receiving unit 61, the scanning unit 63, the acquiring unit 65, and the analyzing unit 67 correspond to steps S102 to S108 in embodiment 1, and the above-described units are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described elements may be implemented as part of an apparatus in a computer system such as a set of computer-executable instructions.
As can be seen from the above, in the above embodiment of the present application, the receiving unit may be used to receive the text to be processed; and then scanning the text by using a scanning unit, and performing slicing storage on the text by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; then, loading an index file by using an acquisition unit, and positioning the file based on index information in the index file to obtain corresponding file content; and utilizing an analysis unit to analyze each file content obtained by positioning by adopting multithreading in a partitioning way. The text processing device provided by the embodiment of the application realizes the aim of accelerating the positioning of file information by introducing the index file, achieves the technical effect of improving the throughput of the system, and solves the technical problems of the prior art that the breakpoint analysis technology is adopted to scan the file information, position the working point with faults, and persist the scanning result to the database, and the database is frequently operated in the scanning and warehousing process, so that the throughput of the system is poor.
In an alternative embodiment, the text processing apparatus further includes: the receiving unit is used for receiving the issued configuration file before scanning the text, wherein the configuration file comprises: a sharding policy file for determining a sharding policy, a policy file sharding policy for determining an index file.
In an alternative embodiment, the configuration file is obtained from a configuration center, and the configuration information in the configuration file is updated periodically.
In an alternative embodiment, the index file includes at least two fields: an initial location and an offset for locating a storage address of a file content, wherein the file content includes at least two fields: index code and data meta information.
Example 3
According to another aspect of the embodiment of the present invention, there is provided a text processing system, and fig. 7 is a schematic diagram of the text processing system according to an embodiment of the present invention, and as shown in fig. 7, the text processing system includes:
A control subsystem 71 for providing a configuration file, wherein the configuration file comprises: a sharding policy file for determining a sharding policy, a policy file sharding policy for determining an index file.
A file scanning subsystem 73, in communication with the control subsystem, for scanning the text to be processed, and storing the text in slices using a slicing strategy, to obtain at least one sliced file, wherein the sliced file comprises: a plurality of pieces of content, and an index file composed of index information of each piece of content.
The file analysis subsystem 75 is configured to load an index file, locate a file based on index information in the index file, obtain corresponding file contents, and perform partition analysis on each located file content by using multiple threads.
The text processing system provided by the embodiment of the invention can provide the configuration file by utilizing the control subsystem, wherein the configuration file comprises: a sharding policy file for determining a sharding policy, and a policy file sharding policy for determining an index file; and then scanning the text to be processed by using a file scanning subsystem which is communicated with the control subsystem, and storing the text in a slicing way by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content; the index file is loaded by the file analysis subsystem, the file is positioned based on the index information in the index file, the corresponding file content is obtained, and each file content obtained by positioning is respectively subjected to partition analysis by adopting multiple threads, so that the aim of accelerating file information positioning by introducing the index file is fulfilled, the technical effect of improving the throughput of a system is achieved, the technical problems that the prior art adopts a breakpoint analysis technology to scan file information, position a working point with a fault, and persistence a scanning result to a database are solved, and the database is frequently operated in a scanning and warehousing process, so that the throughput of the system is poor are solved.
In an alternative embodiment, the text processing system further includes: the configuration center is respectively provided with a communication relationship among the control subsystem, the file scanning subsystem and the file analyzing subsystem, and is used for receiving and storing configuration files issued by the control subsystem and providing the configuration files for the file scanning subsystem and the file analyzing subsystem.
In an alternative embodiment, the text processing system further includes: and the file server is communicated with the file scanning subsystem and is used for sending the files under the formulated directory to the file scanning subsystem.
Example 4
According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium including a stored computer program, wherein the computer program when executed by a processor controls a device in which the computer storage medium is located to perform a method of processing text according to any one of the above.
Example 5
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a computer program, where the computer program executes a method for processing text according to any one of the above.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (8)

1. A method for processing text, comprising:
Receiving text to be processed;
receiving an issued configuration file, wherein the configuration file comprises: a slicing strategy file for determining slicing strategies, a strategy file slicing strategy for determining index files, obtaining the configuration files from a configuration center, and periodically updating configuration information in the configuration files;
scanning the text, and storing the text in a slicing way by using a slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content;
Loading the index file, and positioning the file based on index information in the index file to obtain corresponding file content;
and carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
2. The method of claim 1, wherein the index file comprises at least two fields: the initial position and the offset of the storage address are used for positioning the file content, wherein the file content at least comprises the following two fields: index code and data meta information.
3. A text processing apparatus, comprising:
A receiving unit for receiving text to be processed;
The receiving unit is configured to receive, before scanning the text, a configuration file sent down, where the configuration file includes: a slicing strategy file for determining slicing strategies, a strategy file slicing strategy for determining index files, obtaining the configuration files from a configuration center, and periodically updating configuration information in the configuration files;
The scanning unit is used for scanning the text, and using a slicing strategy to store the text in a slicing way to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content;
The acquisition unit is used for loading the index file, and carrying out file positioning based on index information in the index file to obtain corresponding file content;
and the analysis unit is used for carrying out partition analysis on each file content obtained by positioning by adopting multithreading.
4. The apparatus of claim 3, wherein the index file comprises at least two fields: the initial position and the offset of the storage address are used for positioning the file content, wherein the file content at least comprises the following two fields: index code and data meta information.
5. A text processing system, comprising:
And the control subsystem is used for providing a configuration file, wherein the configuration file comprises: a sharding policy file for determining a sharding policy, and a policy file sharding policy for determining an index file;
The file scanning subsystem is communicated with the control subsystem and is used for scanning a text to be processed, and the text is stored in a slicing mode by using the slicing strategy to obtain at least one slicing file, wherein the slicing file comprises: a plurality of pieces of content, an index file composed of index information of each piece of content;
And the file analysis subsystem is used for loading the index file, carrying out file positioning based on index information in the index file to obtain corresponding file contents, and carrying out partition analysis on each file content obtained by positioning by adopting multiple threads.
6. The text processing system of claim 5, further comprising:
The configuration center is respectively in communication relation with the control subsystem, the file scanning subsystem and the file analyzing subsystem, and is used for receiving and storing the configuration file issued by the control subsystem and providing the configuration file for the file scanning subsystem and the file analyzing subsystem.
7. The text processing system of claim 5, further comprising:
And the file server is communicated with the file scanning subsystem and is used for sending the files under the established directory to the file scanning subsystem.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program, wherein the computer program, when run by a processor, controls a device in which the computer storage medium is located to perform the method of processing text according to any one of claims 1 to 2.
CN202110045135.XA 2021-01-13 2021-01-13 Text processing method and device and text processing system Active CN112749125B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110045135.XA CN112749125B (en) 2021-01-13 2021-01-13 Text processing method and device and text processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110045135.XA CN112749125B (en) 2021-01-13 2021-01-13 Text processing method and device and text processing system

Publications (2)

Publication Number Publication Date
CN112749125A CN112749125A (en) 2021-05-04
CN112749125B true CN112749125B (en) 2024-05-03

Family

ID=75651755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110045135.XA Active CN112749125B (en) 2021-01-13 2021-01-13 Text processing method and device and text processing system

Country Status (1)

Country Link
CN (1) CN112749125B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468119A (en) * 2021-05-31 2021-10-01 北京明朝万达科技股份有限公司 File scanning method and device
CN113836088A (en) * 2021-08-31 2021-12-24 北京明朝万达科技股份有限公司 File processing method, system and device based on depth scanning and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101193273A (en) * 2006-11-20 2008-06-04 中兴通讯股份有限公司 A storage and playing method for real time multimedia image information
US9483568B1 (en) * 2013-06-05 2016-11-01 Google Inc. Indexing system
CN108089977A (en) * 2017-11-28 2018-05-29 维沃移动通信有限公司 A kind of abnormality eliminating method of application program, device and mobile terminal
CN111367860A (en) * 2018-12-26 2020-07-03 北京奇虎科技有限公司 File refreshing method and device
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method and device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101193273A (en) * 2006-11-20 2008-06-04 中兴通讯股份有限公司 A storage and playing method for real time multimedia image information
US9483568B1 (en) * 2013-06-05 2016-11-01 Google Inc. Indexing system
CN108089977A (en) * 2017-11-28 2018-05-29 维沃移动通信有限公司 A kind of abnormality eliminating method of application program, device and mobile terminal
CN111367860A (en) * 2018-12-26 2020-07-03 北京奇虎科技有限公司 File refreshing method and device
CN111831622A (en) * 2020-03-31 2020-10-27 北京嘀嘀无限科技发展有限公司 Data index generation method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN112749125A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
US9575984B2 (en) Similarity analysis method, apparatus, and system
CN112749125B (en) Text processing method and device and text processing system
US7721288B2 (en) Organizing transmission of repository data
CN109522316B (en) Log processing method, device, equipment and storage medium
CN109271545B (en) Feature retrieval method and device, storage medium and computer equipment
CN105512283A (en) Data quality management and control method and device
JP2020057416A (en) Method and device for processing data blocks in distributed database
CN111831625B (en) Data migration method, data migration device, and readable storage medium
CN110795614A (en) Index automatic optimization method and device
CN113297182A (en) Data migration method, device, storage medium and program product
CN114116762A (en) Offline data fuzzy search method, device, equipment and medium
CN111367926A (en) Data processing method and device for distributed system
CN110515895B (en) Method and system for carrying out associated storage on data files in big data storage system
CA3027220A1 (en) Tracking file movement in a network environment
CN114490554A (en) Data synchronization method and device, electronic equipment and storage medium
CN113672616B (en) Data indexing method, device, terminal and storage medium
CN110851437A (en) Storage method, device and equipment
CN114153378A (en) Database memory management system and method
CN105657473A (en) Data processing method and device
CN110543452A (en) data acquisition method and equipment
US11860678B2 (en) Optimized sampling of resource content data for session recording under communication constraints by independently capturing agents
US20060288340A1 (en) System for acquisition, representation and storage of streaming data
CN113111194B (en) Object metadata aggregation method, object metadata reading device, object metadata equipment and storage medium
CN112506877B (en) Data deduplication method, device and system based on deduplication domain and storage equipment
CN116361349A (en) Data processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant