CN117688350A - Method and device for identifying file content - Google Patents

Method and device for identifying file content Download PDF

Info

Publication number
CN117688350A
CN117688350A CN202311800044.8A CN202311800044A CN117688350A CN 117688350 A CN117688350 A CN 117688350A CN 202311800044 A CN202311800044 A CN 202311800044A CN 117688350 A CN117688350 A CN 117688350A
Authority
CN
China
Prior art keywords
content
file
model
text
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311800044.8A
Other languages
Chinese (zh)
Inventor
申奥
林赞磊
商雷
宋阳
陈博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Great Wall Technology Co ltd
Original Assignee
New Great Wall Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New Great Wall Technology Co ltd filed Critical New Great Wall Technology Co ltd
Priority to CN202311800044.8A priority Critical patent/CN117688350A/en
Publication of CN117688350A publication Critical patent/CN117688350A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method and equipment for identifying file contents, wherein the equipment comprises the following steps: and (3) a configuration module: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item; and a task module: creating a recognition task and primarily recognizing the content of the file, and dividing the recognition task of the file content into a character recognition task and an OCR recognition task according to a primary recognition result; an acquisition module; and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module. The invention can automatically identify whether character recognition or OCR picture recognition is adopted, and can realize flexible and configurable processing of acquisition items by providing functions of file content definition, acquisition item configuration and the like.

Description

Method and device for identifying file content
Technical Field
The present invention relates to the field of data identification, and more particularly, to a method and apparatus for identifying file content.
Background
With the increasing market competition, the demands of enterprises for technical innovation are also higher, and more enterprises begin to introduce a data management system to manage various files of the enterprises. The management and identification of various file contents is a significant challenge for enterprises, especially in handling large amounts of data having fixed format or item information. For example, enterprises have introduced patent management systems that manage the full lifecycle of enterprise patent applications, authorizations, uses, and the like. However, the enterprise operator or proxy agency operator can transact patent application or other patent affairs according to the approved proposal, when the operator receives the official notice, the operator needs to register the official documents or input the electronic official documents in batch in time, and collect specific data items in the notice. This brings great effort to the acquisition of data, which is time consuming and wasteful of labor costs.
Accordingly, there is a need in the art for a solution that can effectively identify and process the content of a file.
The above information disclosed in the background section is only for a further understanding of the background of the invention and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The invention provides a method and equipment for identifying file contents. The invention can solve the problem that the file content with various fixed formats or items can not be effectively identified and processed at present.
A first aspect of the present invention provides a method and apparatus for identifying content of a file, the apparatus comprising: and (3) a configuration module: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item; and a task module: creating a recognition task and primarily recognizing the content of the file, and dividing the recognition task of the file content into a character recognition task and an OCR recognition task according to a primary recognition result; an acquisition module; and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
A second aspect of the present invention provides a method for identifying content of a file, the method comprising: s1: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item; s2: creating a recognition task and primarily recognizing file contents, and dividing the recognition task of the file contents into a character recognition task and an OCR recognition task according to a primary recognition result; s3: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
According to the solution provided by the invention for identifying and processing the file content, by uploading the file content, the self-identification adopts character identification or OCR picture identification, and the functions of document definition, acquisition item configuration, content formatting and the like are provided by combining with the needs of operators, so that the flexible and configurable processing of acquisition item acquisition is realized, wherein the OCR picture identification model is trained, evaluated and inferred, the accuracy of OCR identification of the notice content is improved, and the solution is unified, convenient and efficient, and does not need manual operation.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a block diagram of an identification device for file content according to an embodiment of the invention.
Fig. 2 shows a block diagram of an implementation of an identification device for file content according to an embodiment of the invention.
FIG. 3 illustrates a flow diagram for a configuration module implementation in accordance with one or more embodiments of the invention.
FIG. 4 shows a schematic diagram of a training process for OCR models in accordance with an exemplary embodiment of the present invention.
Fig. 5 shows a flow chart of a method for identifying file content according to an embodiment of the invention.
Fig. 6 shows a flowchart of an implementation of a method for identifying file content according to an embodiment of the invention.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
As used herein, the terms "first," "second," and the like may be used to describe elements in exemplary embodiments of the present invention. These terms are only used to distinguish one element from another element, and the inherent feature or sequence of the corresponding element, etc. is not limited by the terms. Unless defined otherwise, all terms (including technical or scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Those skilled in the art will understand that the devices and methods of the present invention described herein and illustrated in the accompanying drawings are non-limiting exemplary embodiments and that the scope of the present invention is defined solely by the claims. The features illustrated or described in connection with one exemplary embodiment may be combined with the features of other embodiments. Such modifications and variations are intended to be included within the scope of the present invention.
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, detailed descriptions of related known functions or configurations are omitted so as not to unnecessarily obscure the technical gist of the present invention. In addition, throughout the description, the same reference numerals denote the same circuits, modules or units, and repetitive descriptions of the same circuits, modules or units are omitted for brevity.
Furthermore, it should be understood that one or more of the following methods or aspects thereof may be performed by at least one control unit or controller. The terms "control unit", "controller", "control module" or "master" may refer to a hardware device comprising a memory and a processor. The memory or computer-readable storage medium is configured to store program instructions, and the processor is specifically configured to execute the program instructions to perform one or more processes that will be described further below. Moreover, it should be appreciated that the following methods may be performed by including a processor in combination with one or more other components, as will be appreciated by those of ordinary skill in the art.
Since the PDF file or the picture file in the prior art has a picture format and a text format, the parsing modes and techniques of different formats have differences (for example, in patent documents, and the notification content areas identified by different documents in different countries are different), it is required to collect data items specific to the file content.
In the prior art, only the technology of character recognition or OCR picture recognition can be selected by people, and after a large number of files are recognized, the required data items are selected by people and copied and pasted. This process has the following problems: 1. the file format selection and identification technology is required to be manually judged, so that the problems of low efficiency and complex operation exist; 2. OCR recognition is affected by factors such as picture definition, watermarks, foreign languages and the like, so that the problem of inaccurate recognition content and cleaning is caused; 3. the identified content requires a cumbersome problem of finding the desired data item.
Therefore, the invention provides a method for uploading a national PDF file or a picture file for an enterprise flow agent or an organization flow agent, and realizing automatic acquisition of PDF content by a system. The solution of the present invention is applicable to files with fixed format or item information, such as: patent documents, trademark documents, insurance policies, financial bills, government documents of all levels or documents for medical care, and the like.
Fig. 1 shows a block diagram of an identification device for file content according to an embodiment of the invention.
As shown in fig. 1, the identification device for file contents includes: and the storage module is used for reading the file uploaded by the user, analyzing the file, and uploading the analyzed file content to the configuration module. The configuration module is used for defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file and configuring the identification rule of the acquisition item. And the task module is used for creating a recognition task and primarily recognizing the content of the file, and dividing the recognition task of the file content into a character recognition task and an OCR recognition task according to the primary recognition result. And the acquisition module is used for acquiring the content of the acquisition item in the file according to the identification task and identifying the text in the content of the acquisition item according to the rule defined in the configuration module. The reading module is used for carrying out format conversion on the text in the acquisition item content identified by the acquisition module; and verifying the text in the content of the acquisition item according to the verification rule in the identification rule, performing data assembly on the identified text in the content of the acquisition item according to the attribute of the file acquisition item, and providing an API for external access.
Fig. 2 shows a block diagram of an implementation of an identification device for file content according to an embodiment of the invention.
As shown in fig. 2, the storage module analyzes pdf content in a compressed package, a mail eml file or a pdf file according to a notification book uploaded by a user, and uploads the pdf content to a file server.
According to one or more embodiments of the present invention, taking a patent or trademark file as an example, the configuration module firstly defines the attribute of the notice document to be collected, including name, naming matching rule, country region, official document information, etc., according to the user's needs and different notice type documents of different countries; secondly, defining the content of a notice to be acquired of each type of document, wherein the notice content comprises acquisition item names, intellectual property types, verification general rules, special rules, returned field names and the like; defining a recognition configuration rule according to the position and the identification of the acquisition item to be acquired in the document, wherein the recognition configuration rule comprises a recognition item special identification, a document position, a regular recognition rule, a verification rule and the like; thereby realizing flexible matching of the document and the content to be acquired.
FIG. 3 illustrates a flow diagram for a configuration module implementation in accordance with one or more embodiments of the invention.
Taking notice files of patents and trademarks as examples, firstly defining various notice files, and then carrying out configuration collection of notice file collection items; configuring rules for collecting item identification; and finally binding the notice document and the collection item, and further generating a document collection rule.
According to one or more embodiments of the present invention, taking notice files of patents and trademarks as an example, the task module creates recognition tasks by batches according to notice file packages of notice files uploaded by users, primarily recognizes contents of the files, subdivides OCR recognition tasks and chinese character recognition tasks in PDF by types, wherein tasks in the task module trigger the acquisition module to acquire PDF or OCR picture data items at regular time.
According to one or more embodiments of the present invention, the configuration module may transmit a flag for text content recognition to the task module, in which, if the configuration module actively transmits the recognition flag, the task module selects the pdcyox character recognition and the OCR recognition according to the flag, if the configuration module does not transmit the recognition flag, the pdcyox character recognition is first used, and whether the pdcyox character recognition can be performed (i.e. a character recognition task is established) is determined by primarily determining the text length and whether a special flag is included according to the recognized content, and if not successfully (i.e. the pdcyox recognition cannot be used), the content in the OCR recognition picture is determined to be adopted (i.e. an OCR recognition task is established). The text content recognition mark comprises a mark with marked OCR (optical character recognition) which directly performs text conversion
According to one or more embodiments of the present invention, the text recognition of PDF files in the present invention uses a text conversion unit that may use a PDFBox tool. Specifically, apache PDFBox is an open source Java library, which supports the development and conversion of PDF documents. By using the library, java programs for creating, converting and operating PDF documents can be developed, and the functions of extracting Unicode text, splitting files, converting images and the like by PDF are realized. The OCR image recognition (or picture recognition) in the invention is realized by sampling an OCR model unit, and the OCR model unit can be a character detection recognition system provided by a PaddlePaddle flying paddle platform for 2 and PaddleOCR, paddleOCR, so that the functions of character detection, character recognition and model reasoning training are provided.
According to one or more embodiments of the present invention, taking a notice document of a patent and a trademark as an example, in an acquisition module, a rule recognition engine receives a recognition task, performs PDF document analysis, a specific analysis flow is as shown in fig. 3-a notice PDF text acquisition item recognition process, firstly analyzes and identifies each notice PDF document according to document definition content of a configuration module, secondly generates a notice PDF document analysis task queue inside the engine, performs PDFBox text analysis on a task for identifying the PDFBox text according to a rule of first-in first-out, if analysis is successful, identifies a corresponding acquisition item according to a recognition configuration rule of the configuration module, otherwise performs according to an OCR recognition task, converts the notice PDF document into a notice picture document according to pages, secondly generates a recognition task for the notice picture document, performs, and again detects and recognizes characters in a picture by using a trained paddleword detection recognition system, then performs merging processing and cleaning on all the characters identified by the picture document of a single notice, and finally recognizes a corresponding acquisition item according to a recognition configuration rule of the configuration module.
According to one or more embodiments of the present invention, the recognition module performs format conversion of the collection item on the parsed data, performs data item verification according to the verification rule in the collection item of the configuration module, and performs data assembly, for example, assembly into json format data, according to the field name in the collection item of the configuration module and the attribute information defined by the document, so as to provide a corresponding API that is accessible externally.
According to one or more embodiments of the present invention, the collection module firstly inputs the content of the collection item to the text conversion unit to perform text analysis of the collection item, and if the text analysis is successful, the text conversion unit is used to perform recognition processing on the content of the collection item; and if the text analysis is unsuccessful, converting the acquired item content into a picture format and inputting the picture format into an OCR model unit, wherein the OCR model unit utilizes a trained OCR model to conduct recognition processing on the acquired item content so as to acquire corresponding text.
According to one or more embodiments of the present invention, in the acquisition module, according to different scenes used in the service, if the scene can be identified by the pdcnox text, a text result can be immediately returned, and a subsequent service flow is synchronously performed on the identified text. However, if the OCR model is used for recognition, the recognition time is long, the calculation amount is large, the acquisition module can only create asynchronous tasks, and the task needs to wait for the execution of the tasks to trigger the subsequent business processes.
FIG. 4 shows a schematic diagram of a training process for OCR models in accordance with an exemplary embodiment of the present invention.
As shown in fig. 4, in accordance with one or more embodiments of the invention, the training process of the OCR model includes: making a data set corresponding to the acquisition item in the file content; training the acquisition item content by using the lightweight detection model of OCR to obtain a text detection model of the acquisition item content and generating an inference model corresponding to the detection model; training the acquisition item content by using the training model of OCR for a plurality of times to obtain an identification model of the acquisition item content and generating an inference model relative to the identification model; and carrying out serial reasoning on the trained detection model and the recognition model to obtain a trained OCR model, wherein the OCR model is used for carrying out text detection and recognition on the content of the acquired items.
In accordance with one or more embodiments of the present invention, taking a patent or trademark notice as an example, the OCR model includes a process of creating a notice dataset, a process of training a notice word detection model, a process of training a notice word recognition model, and a process of converting into a notice inference model. Specifically, the process of creating the notice dataset includes: using a fitz library and a pymupdf library to realize pdf conversion into pictures, and dividing the pictures into a training set and a verification set according to a ratio of 2:1; and secondly, marking the positions to be acquired by using a PPOCRLAbel tool, and constructing a notice data set based on patents and trademarks. The process of training the notice text detection model comprises the following steps: and training a character detection model based on the finishing training of the PP-OCRv3 lightweight detection model, training a notice data set based on the ml_PP-OCRv3_det training model for a plurality of times, and fine-tuning parameters of the extracted Student structure to enable the final hmean to reach more than 0.8. The process of training the text recognition model of the notice comprises the following steps: the training text recognition model based on the finetune of the PP-OCRv3 lightweight detection model is used for carrying out fine adjustment on parameters of the extracted distillation (knowledge distillation) structure for a plurality of times based on the ch_ppr_server_v2.0_rec model, the en_number_mobile_v2.0_rec model and the japan_PP-OCRv3_rec model training notice data set, so that the final hmean reaches more than 0.8. The process of converting into the notification inference model includes: the trained model is converted into an inference model, and any trained detection model and any recognition model are connected in series into a two-stage (namely detection model reasoning and recognition model reasoning) text recognition model by using a detection and recognition model series tool provided by PaddleOCR. The OCR model outputs the text position and the recognition result in the file content through four main stages of text detection, detection frame correction, text recognition and score filtration through inputting the notice verification set image. Taking patent or trademark notice as an example, the OCR model performs the production, training, recognition and reasoning of notification data sets of various countries, and improves the accuracy of OCR recognition of the notice content.
According to one or more embodiments of the invention, the OCR model of the invention may also be used for other types of document content recognition, such as: insurance policies for documents with fixed format or item information, financial billing, government documents of various levels or documents for medical relevance, etc. The above examples merely take notice files of patents or trademarks as examples, and the identification and processing solutions of the file contents of the present invention are similar to other files with fixed format or item information, and are not described herein in detail.
Fig. 5 shows a flow chart of a method for identifying file content according to an embodiment of the invention.
As shown in fig. 5, the identification method for file contents includes:
s1: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item;
s2: creating a recognition task and primarily recognizing file contents, and dividing the recognition task of the file contents into a character recognition task and an OCR recognition task according to a primary recognition result;
s3: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
Wherein, before S1, further comprises: reading a file uploaded by a user, and analyzing to obtain analyzed file content; and after S3 further comprises: performing format conversion on the text in the identified acquisition item content; and verifying the text in the content of the acquisition item according to the verification rule in the identification rule, performing data assembly on the identified text in the content of the acquisition item according to the attribute of the file acquisition item, and providing an API for external access.
Fig. 6 shows a flowchart of an implementation of a method for identifying file content according to an embodiment of the invention.
Taking the file content of patent or trademark notices as an example, as shown in fig. 6, PDF files of the respective notices are first parsed and identified according to the content defined by the notice document; then generating a notice PDF file analysis task queue; and judging whether the analysis of the PDFBox text is passed or not, if so, adopting the analysis result of the PDFBox file, and identifying the corresponding acquisition item content according to the identification configuration rule in the configuration module. If the analysis of the PDFBox text fails or fails, converting the notice PDF file into a notice picture file by utilizing the PDFBox according to pages; then generating and executing an identification task of the notice picture file; detecting and identifying text content in the picture by using PaddleOCR; and combining the characters identified by all the picture files of the single notice according to the format, cleaning the data, and finally identifying the corresponding acquisition item content according to the identification configuration rule in the configuration module.
The solution of the invention combines two technologies of text recognition and OCR model recognition, realizes the text recognition of PDF literature contents, combines the functions of document definition, acquisition item configuration, content formatting and the like, and realizes the flexible configurable processing of acquisition items.
In accordance with one or more embodiments of the present invention, control logic in the devices and systems of the present invention may implement processes as in the above systems of the present invention using encoded instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium (e.g., hard disk drive, flash memory, read-only memory, optical disk, digital versatile disk, cache, random access memory, and/or any other storage device or storage disk) where information during any time period (e.g., extended period of time, permanent, transient instance, temporary cache, and/or information cache) is stored. As used herein, the term "non-transitory computer-readable medium" is expressly defined to include any type of computer-readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
Logic in the system of the present invention may be implemented using control circuitry, (control logic, a master control system, or a control module) that may include one or more processors or may include a non-transitory computer readable medium therein, in accordance with one or more embodiments of the present invention. In particular, the master control system or control module may comprise a microcontroller MCU. Processors used to implement the processing of logic in the system of the present invention may be, for example, but are not limited to, one or more single-core or multi-core processors. The processor(s) may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, etc.). The processor may be coupled to and/or may include a memory/storage device and may be configured to execute instructions stored in the memory/storage device to implement various applications and/or operating systems running on the controller of the present invention.
The following are further examples of the invention:
example 1. An identification device for file content, the device comprising: a configuration module configured to: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item; a task module configured to: creating a recognition task and primarily recognizing the content of the file, and dividing the recognition task of the file content into a character recognition task and an OCR recognition task according to a primary recognition result; an acquisition module configured to: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
Example 2 the apparatus of example 1, further comprising: a storage module configured to: and reading the file uploaded by the user, analyzing the file, and uploading the analyzed file content to the configuration module.
Example 3 the apparatus of example 1, further comprising: a reading module configured to: performing format conversion on the text in the acquisition item content identified by the acquisition module; and verifying the text in the content of the acquisition item according to the verification rule in the identification rule, performing data assembly on the identified text in the content of the acquisition item according to the attribute of the file acquisition item, and providing an API for external access.
Example 4 the apparatus of example 1, wherein: the acquisition module further comprises a text conversion unit and an OCR model unit, wherein the acquisition module firstly inputs the acquired item content to the text conversion unit to analyze the text of the acquired item, and if the text analysis is successful, the text conversion unit is used for identifying the acquired item content; and if the text analysis is unsuccessful, converting the acquired item content into a picture format and inputting the picture format into an OCR model unit, wherein the OCR model unit utilizes a trained OCR model to conduct recognition processing on the acquired item content so as to acquire corresponding text.
Example 5. The apparatus of claim 4, wherein the training process of the OCR model comprises: making a data set corresponding to the acquisition item in the file content; training the acquisition item content by using the lightweight detection model of OCR to obtain a text detection model of the acquisition item content and generating an inference model corresponding to the detection model; training the acquisition item content by using the training model of OCR for a plurality of times to obtain an identification model of the acquisition item content and generating an inference model relative to the identification model; and carrying out serial reasoning on the trained detection model and the recognition model to obtain a trained OCR model, wherein the OCR model is used for carrying out text detection and recognition on the content of the acquired items.
Example 6. The apparatus of example 1, wherein the document comprises a patent or trademark document.
Example 7 the apparatus of example 1, wherein the file attributes comprise attributes of a document in the file content, the document attributes comprising: name, naming matching rules, country region, official information.
Example 8 the device of example 1, wherein the attribute of the acquisition item comprises: one or more of item names, types, check generic and special rules, and returned field names are collected.
Example 9 the apparatus of example 1, wherein the rule comprises: one or more of identification term special identification, document position, regular identification rule and check rule.
Example 10 the apparatus of example 1, wherein the text conversion unit is a PDFBox and the OCR model is PaddleOCR.
Example 11 the apparatus of example 1, wherein the document is an insurance policy, a financial bill, a government disclosure document at a level or a document for medical use; wherein the file content is in pdf or picture format.
Example 12 the apparatus of example 1, wherein the configuration module is to actively communicate the text content recognition token to the task module, and if the configuration module actively communicates the text content recognition token, the task module is to establish a task to select a text conversion unit (e.g., a PDFBox) for recognition by the text recognition and OCR model recognition unit based on the token; if the configuration module does not transmit text content identification marks, firstly, performing text identification by using a text conversion unit, primarily judging whether the text length and the special mark are contained according to the identified content to determine whether the text can be used for text identification by using the text conversion unit, then establishing a text identification task, if the text can not be used for text identification by using the text conversion unit, determining to adopt an OCR identification unit for identification, and establishing, namely establishing an OCR identification task; wherein the text content recognition mark comprises a mark with marked OCR (optical character recognition) which directly performs text conversion.
Example 13. The apparatus of example 4, wherein the collection module includes a rule recognition engine, the rule recognition engine is configured to receive the recognition task, parse and identify the content of the collection item according to the attribute defined in the configuration module, generate a task queue, and send the task queue to the text conversion unit according to the rule of first in first out.
14. The apparatus of example 5, wherein creating a dataset corresponding to an acquisition item in file content during training of the OCR model comprises: converting file contents into pictures by using a fitz library and a pymupdf library, and dividing the pictures into a training set and a verification set according to a ratio of 2:1; using the PPOCRLAbel tool, the locations of the collection items in the desired file content are tagged and a dataset based on the file content collection items is constructed.
Example 14. The apparatus of example 5, wherein, in training the captured item content using the lightweight detection model of OCR to obtain a text detection model of the captured item content and generating an inference model corresponding to the detection model, the text detection model is trained based on a finetune of the PP-OCRv3 lightweight detection model.
Example 15. The apparatus of example 5, wherein training the captured item content using the training model of OCR a plurality of times to obtain a recognition model of the captured item content and generating an inference model relative to the recognition model comprises: training a data set based on a ml_PP-OCRv3_det training model for a plurality of times, and fine-tuning parameters of a Student structure extracted by the ml_PP-OCRv3_det training model to enable the final hmean to reach more than 0.8; training a character recognition model based on the finetune of the PP-OCRv3 lightweight detection model, and performing fine adjustment on the extracted parameters of the distillation structure based on the training data set of the ch_ppr_server_v2.0_rec model, the en_number_mobile_v2.0_rec model and the japan_PP-OCRv3_rec model for a plurality of times to enable the final hmean to reach more than 0.8; and finally, converting the trained recognition model into an inference model.
Example 16. A method for identifying file content, the method comprising: s1: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item; s2: creating a recognition task and primarily recognizing file contents, and dividing the recognition task of the file contents into a character recognition task and an OCR recognition task according to a primary recognition result; s3: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
Example 17 the method of example 16, further comprising, prior to S1: reading a file uploaded by a user, and analyzing to obtain analyzed file content; and after S3, further comprising: performing format conversion on the text in the identified acquisition item content; and verifying the text in the content of the acquisition item according to the verification rule in the identification rule, performing data assembly on the identified text in the content of the acquisition item according to the attribute of the file acquisition item, and providing an API for external access.
An example 18. The method according to example 16, wherein in S3, text parsing is performed on the content of the collection item first, and if text parsing is successful, the content of the collection item is subjected to recognition processing; and if the text analysis is unsuccessful, converting the content of the acquired item into a picture format and performing OCR recognition processing, wherein the OCR recognition processing process utilizes a trained OCR model to perform recognition processing on the content of the acquired item.
Example 19 the method of example 16, wherein the training process of the OCR model comprises: making a data set corresponding to the acquisition item in the file content; training the acquisition item content by using the lightweight detection model of OCR to obtain a text detection model of the acquisition item content and generating an inference model corresponding to the detection model; training the acquisition item content by using the training model of OCR for a plurality of times to obtain an identification model of the acquisition item content and generating an inference model relative to the identification model; and carrying out serial reasoning on the trained detection model and the recognition model to obtain a trained OCR model, wherein the OCR model is used for carrying out text detection and recognition on the content of the acquired items.
Example 20. The method of example 16, wherein the document comprises a patent or trademark document.
Example 21 the method of example 16, wherein the file attributes comprise attributes of a document in the file content, the document attributes comprising: name, naming matching rules, country region, official information.
Example 22 the method of example 16, wherein the attributes of the acquisition item comprise: one or more of item names, types, check generic and special rules, and returned field names are collected.
Example 23 the method of example 16, wherein the rule comprises: one or more of identification term special identification, document position, regular identification rule and check rule.
Example 24 the method of example 16, wherein the text conversion unit is a PDFBox and the OCR model is PaddleOCR.
Example 25 the method of example 16, wherein the document is an insurance policy, a financial bill, a government disclosure document at a level or a document for medical use.
Example 26. The method of example 16, wherein during the configuring, the text content recognition tags may be actively passed to the task creation process, and if the text content recognition tags are actively passed during the configuring, the task creation process may create a task based on the tags that selects text conversion units (e.g., the PDFBox) word recognition and OCR model recognition units to recognize; if the configuration process does not transmit text content recognition marks, firstly performing text recognition, primarily judging whether text length and special marks are contained according to the recognized content to determine whether text recognition can be performed or not, then establishing a text recognition task, if text recognition cannot be performed, determining that OCR recognition is adopted for recognition, and establishing an OCR recognition task; wherein, the file content is pdf or in a picture format.
Example 27, the method of example 16, wherein the S3 further comprises: and receiving an identification task, analyzing and identifying the content of the acquisition item by the defined attribute, generating a task queue, and sending the task queue into a text conversion unit according to a first-in first-out rule.
25. The method of example 20, wherein creating a dataset corresponding to the collection item in the file content during training of the OCR model includes: converting file contents into pictures by using a fitz library and a pymupdf library, and dividing the pictures into a training set and a verification set according to a ratio of 2:1; using the PPOCRLAbel tool, the locations of the collection items in the desired file content are tagged and a dataset based on the file content collection items is constructed.
Example 28. The method of example 19, wherein, in training the captured item content using the lightweight detection model of OCR to obtain a text detection model of the captured item content and generating an inference model corresponding to the detection model, a text detection model is trained based on a finetune of the PP-OCRv3 lightweight detection model.
Example 29. The method of example 19, wherein training the captured item content using the training model of OCR a plurality of times to obtain a recognition model of the captured item content and generating an inference model relative to the recognition model comprises: training a data set based on a ml_PP-OCRv3_det training model for a plurality of times, and fine-tuning parameters of a Student structure extracted by the ml_PP-OCRv3_det training model to enable the final hmean to reach more than 0.8; training a character recognition model based on the finetune of the PP-OCRv3 lightweight detection model, and performing fine adjustment on the extracted parameters of the distillation structure based on the training data set of the ch_ppr_server_v2.0_rec model, the en_number_mobile_v2.0_rec model and the japan_PP-OCRv3_rec model for a plurality of times to enable the final hmean to reach more than 0.8; and finally, converting the trained recognition model into an inference model.
Example 30. One or more non-transitory storage media having instructions stored thereon that, when executed by a processor, cause the processor to implement the method of any of examples 16-29.
Example 31. A word recognition processing system, the system comprising the apparatus of any of examples 1-15, or using the method of any of examples 16-29, or comprising the non-transitory storage medium of example 30.
The figures and detailed description of the invention referred to above as examples of the invention are intended to illustrate the invention, but not to limit the meaning or scope of the invention described in the claims. Accordingly, modifications may be readily made by one skilled in the art from the foregoing description. In addition, one skilled in the art may delete some of the constituent elements described herein without deteriorating the performance, or may add other constituent elements to improve the performance. Furthermore, one skilled in the art may vary the order of the steps of the methods described herein depending on the environment of the process or equipment. Thus, the scope of the invention should be determined not by the embodiments described above, but by the claims and their equivalents.
While the invention has been described in connection with what is presently considered to be practical, it is to be understood that the invention is not to be limited to the disclosed embodiments, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. An identification device for file content, the device comprising:
a configuration module configured to: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item;
a task module configured to: creating a recognition task and primarily recognizing the content of the file, and dividing the recognition task of the file content into a character recognition task and an OCR recognition task according to a primary recognition result;
an acquisition module configured to: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
2. The apparatus of claim 1, wherein the apparatus further comprises:
a storage module configured to: and reading the file uploaded by the user, analyzing the file, and uploading the analyzed file content to the configuration module.
3. The apparatus of claim 1, wherein the apparatus further comprises: a reading module configured to:
performing format conversion on the text in the acquisition item content identified by the acquisition module;
verifying the text in the content in the collected item according to the verification rules in the identification rules, and
and data assembling the text in the identified acquisition item content according to the attribute of the file acquisition item, and providing an API for external access.
4. The apparatus according to claim 1, wherein: in the acquisition module, a text conversion unit and an OCR model unit are also included, wherein,
the acquisition module firstly inputs the acquired item content to the text conversion unit to analyze the text of the acquired item, and if the text analysis is successful, the text conversion unit is used for identifying the acquired item content; and if the text analysis is unsuccessful, converting the acquired item content into a picture format and inputting the picture format into an OCR model unit, wherein the OCR model unit utilizes a trained OCR model to conduct recognition processing on the acquired item content so as to acquire corresponding text.
5. The apparatus of claim 4, wherein the training process of the OCR model comprises:
making a data set corresponding to the acquisition item in the file content;
training the acquisition item content by using the lightweight detection model of OCR to obtain a text detection model of the acquisition item content and generating an inference model corresponding to the detection model;
training the acquisition item content by using the training model of OCR for a plurality of times to obtain an identification model of the acquisition item content and generating an inference model relative to the identification model;
and carrying out serial reasoning on the trained detection model and the recognition model to obtain a trained OCR model, wherein the OCR model is used for carrying out text detection and recognition on the content of the acquired items.
6. The apparatus of claim 1, wherein the document comprises a patent or trademark document.
7. A method for identifying content of a file, the method comprising:
s1: defining the file and the attribute of the acquisition item in the file, acquiring the position of the acquisition item in the file, and configuring the identification rule of the acquisition item;
s2: creating a recognition task and primarily recognizing file contents, and dividing the recognition task of the file contents into a character recognition task and an OCR recognition task according to a primary recognition result;
s3: and acquiring the content of the acquisition item in the file according to the identification task, and identifying the text in the content of the acquisition item according to the rule defined in the configuration module.
8. The method of claim 7, further comprising, prior to S1: reading a file uploaded by a user, and analyzing to obtain analyzed file content; and
the step S3 further includes: performing format conversion on the text in the identified acquisition item content; and verifying the text in the content of the acquisition item according to the verification rule in the identification rule, performing data assembly on the identified text in the content of the acquisition item according to the attribute of the file acquisition item, and providing an API for external access.
9. The method of claim 7, further comprising, in S3:
text analysis is carried out on the content of the acquisition item, and if the text analysis is successful, the content of the acquisition item is identified; and if the text analysis is unsuccessful, converting the content of the acquired item into a picture format and performing OCR recognition processing, wherein the OCR recognition processing process utilizes a trained OCR model to perform recognition processing on the content of the acquired item.
10. The method according to claim 9, wherein: the training process of the OCR model comprises the following steps:
making a data set corresponding to the acquisition item in the file content;
training the acquisition item content by using the lightweight detection model of OCR to obtain a text detection model of the acquisition item content and generating an inference model corresponding to the detection model;
training the acquisition item content by using the training model of OCR for a plurality of times to obtain an identification model of the acquisition item content and generating an inference model relative to the identification model;
and carrying out serial reasoning on the trained detection model and the recognition model to obtain a trained OCR model, wherein the OCR model is used for carrying out text detection and recognition on the content of the acquired items.
CN202311800044.8A 2023-12-25 2023-12-25 Method and device for identifying file content Pending CN117688350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311800044.8A CN117688350A (en) 2023-12-25 2023-12-25 Method and device for identifying file content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311800044.8A CN117688350A (en) 2023-12-25 2023-12-25 Method and device for identifying file content

Publications (1)

Publication Number Publication Date
CN117688350A true CN117688350A (en) 2024-03-12

Family

ID=90131845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311800044.8A Pending CN117688350A (en) 2023-12-25 2023-12-25 Method and device for identifying file content

Country Status (1)

Country Link
CN (1) CN117688350A (en)

Similar Documents

Publication Publication Date Title
CN109543690B (en) Method and device for extracting information
CN110135411B (en) Business card recognition method and device
RU2571545C1 (en) Content-based document image classification
US20230106873A1 (en) Text extraction method, text extraction model training method, electronic device and storage medium
CN111753717B (en) Method, device, equipment and medium for extracting structured information of text
CN107153716B (en) Webpage content extraction method and device
CN112434690A (en) Method, system and storage medium for automatically capturing and understanding elements of dynamically analyzing text image characteristic phenomena
CN113780229A (en) Text recognition method and device
CN113705733A (en) Medical bill image processing method and device, electronic device and storage medium
CN112686257A (en) Storefront character recognition method and system based on OCR
CN112766255A (en) Optical character recognition method, device, equipment and storage medium
CN110209759B (en) Method and device for automatically identifying page
CN113780116A (en) Invoice classification method and device, computer equipment and storage medium
CN111881900A (en) Corpus generation, translation model training and translation method, apparatus, device and medium
CN110059184B (en) Operation error collection and analysis method and system
CN116384344A (en) Document conversion method, device and storage medium
CN114863459A (en) Out-of-order document sorting method and device and electronic equipment
CN114155547B (en) Chart identification method, device, equipment and storage medium
CN113486171B (en) Image processing method and device and electronic equipment
CN117688350A (en) Method and device for identifying file content
CN113408446B (en) Bill accounting method and device, electronic equipment and storage medium
CN114743012A (en) Text recognition method and device
CN114067343A (en) Data set construction method, model training method and corresponding device
Pattnaik et al. A Framework to Detect Digital Text Using Android Based Smartphone
CN111652229B (en) Information input method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination