WO2021121158A1 - 公文文件处理方法、装置、计算机设备及存储介质 - Google Patents
公文文件处理方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2021121158A1 WO2021121158A1 PCT/CN2020/135718 CN2020135718W WO2021121158A1 WO 2021121158 A1 WO2021121158 A1 WO 2021121158A1 CN 2020135718 W CN2020135718 W CN 2020135718W WO 2021121158 A1 WO2021121158 A1 WO 2021121158A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- format
- reviewed
- content
- document
- detection
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/412—Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/418—Document matching, e.g. of document images
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- This application relates to the field of data analysis of big data, and in particular to a method, device, computer equipment, and storage medium for processing official documents.
- a method for processing official documents including:
- the preset bert model identifies the content of all document components in the official document to be reviewed of the standard document type
- the format detection results, content detection results, and layout detection results are obtained;
- the text format detection includes calling and each The format detection rule corresponding to the file component content, extract the text format keywords in the file component content, and obtain the format detection result according to the text format keywords and the format bar in the format detection rule corresponding to the text format keywords;
- the text content detection includes the content detection of the file component content, and then the content detection result is obtained;
- the frame layout detection includes the division of the coordinate information of the official document to be reviewed of the standard document type, and according to the divided The coordinate information is used to perform frame layout inspection on the official document to be reviewed, and obtain the layout inspection result;
- the detection error content is generated, the standard writing rule corresponding to the detection error content is called, and the detection error content and the standard writing rule are marked in the waiting list.
- the preset location in the official document file is reviewed, and the official document file to be reviewed that has been successfully marked is sent to the preset receiving location according to the storage path specified by the user.
- An official document processing device including:
- the identification module is used to receive the review request containing the official document to be reviewed sent by the user, analyze the format of the official document to be reviewed and obtain the file type of the official document to be reviewed, and then obtain the standard document type to be reviewed Official documents, and identify all the file components in the official documents to be reviewed of the standard document type through the preset bert model;
- the acquisition module is used to obtain format detection results, content detection results, and format detection results after synchronously performing text format detection, text content detection, and frame format detection through a preset text processing model constructed based on a distributed framework;
- the text format detection It includes calling the format detection rules corresponding to the content of each of the file components, extracting text format keywords in the file component content, and obtaining according to the text format keywords and the format bars in the format detection rules corresponding to them Format detection results;
- the text content detection includes the content detection of the document component content, and then obtains the content detection results;
- the framework layout detection includes the division of coordinate information on the official document to be reviewed of the standard document type, and Perform frame layout inspection on the official document to be reviewed according to the divided coordinate information, and obtain a layout inspection result;
- the sending module is used to generate detected error content based on the format detection result, content detection result, and layout detection result, call the standard writing rule corresponding to the detected error content, and combine the detected error content and the standard writing rule Mark the preset location in the official document to be reviewed, and send the successfully marked official document to be reviewed to the preset receiving location according to the storage path specified by the user.
- a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
- the preset bert model identifies the content of all document components in the official document to be reviewed of the standard document type
- the format detection results, content detection results, and layout detection results are obtained;
- the text format detection includes calling and each The format detection rule corresponding to the file component content, extract the text format keywords in the file component content, and obtain the format detection result according to the text format keywords and the format bar in the format detection rule corresponding to the text format keywords;
- the text content detection includes the content detection of the file component content, and then the content detection result is obtained;
- the frame layout detection includes the division of the coordinate information of the official document to be reviewed of the standard document type, and according to the divided The coordinate information is used to perform frame layout inspection on the official document to be reviewed, and obtain the layout inspection result;
- the detection error content is generated, the standard writing rule corresponding to the detection error content is called, and the detection error content and the standard writing rule are marked in the waiting list.
- the preset location in the official document file is reviewed, and the official document file to be reviewed that has been successfully marked is sent to the preset receiving location according to the storage path specified by the user.
- One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
- the preset bert model identifies the content of all document components in the official document to be reviewed of the standard document type
- the format detection results, content detection results, and layout detection results are obtained;
- the text format detection includes calling and each The format detection rule corresponding to the file component content, extract the text format keywords in the file component content, and obtain the format detection result according to the text format keywords and the format bar in the format detection rule corresponding to the text format keywords;
- the text content detection includes the content detection of the file component content, and then the content detection result is obtained;
- the frame layout detection includes the division of the coordinate information of the official document to be reviewed of the standard document type, and according to the divided The coordinate information is used to perform frame layout inspection on the official document to be reviewed, and obtain the layout inspection result;
- the detection error content is generated, the standard writing rule corresponding to the detection error content is called, and the detection error content and the standard writing rule are marked on the waiting list.
- the preset location in the official document file is reviewed, and the official document file to be reviewed that has been successfully marked is sent to the preset receiving location according to the storage path specified by the user.
- the above-mentioned official document processing method, device, computer equipment and storage medium through the preset text processing model constructed by the distributed framework, simultaneously realize the review of various standardized requirements of the official document to be reviewed (including the text format, text content and Framework layout review processing), not only does not require the use of manpower, but also can quickly and accurately complete the review of an official document to be reviewed, which can ensure that the document review specification points are not missed, and it can be seen that it can improve review efficiency and review accuracy; and each type
- the review of standardized requirements exists separately and does not affect each other, and the detected error content and standard writing rules are marked in the preset position in the official document to be reviewed in the form of annotation, so that the user can directly modify the official document to be reviewed according to the content of the annotation.
- FIG. 1 is a schematic diagram of an application environment of a method for processing official documents in an embodiment of the present application
- FIG. 3 is a schematic diagram of the structure of an official document processing device in an embodiment of the present application.
- Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.
- the official document file processing method provided in this application can be applied in the application environment as shown in Fig. 1, in which the client communicates with the server through the network.
- the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
- the server can be implemented as an independent server or a server cluster composed of multiple servers.
- a method for processing official documents is provided.
- the method is applied to the server in FIG. 1 as an example for description, including the following steps:
- the official documents to be reviewed can be official documents to be reviewed and standardized in various agencies and institutions.
- each official document to be reviewed has at least one standardization requirement of the standard official document, and the standardization requirement can be Including but not limited to file format, file content, file text format and file layout, etc.
- the user can send the review request by the display device after the display device selects the review request of the official document to be reviewed, where the review request All standardization requirements in the official document to be reviewed can be tested; the official document to be reviewed that identifies the standard document type is a process of converting structured text data (documents to be reviewed) into meaningful text data for text analysis. After the process is over, the file component content of various structural components can be obtained.
- the recognition process a variety of analytical techniques such as language, statistics, and machine learning models can be used in the recognition process.
- it is mainly analyzed from the official document file to be reviewed. Excavate and identify the content of all document components.
- the document component content mentioned in this embodiment includes document number, document title, document main sending unit, document body, document signature, document attachment, document note, etc.; preset bert
- the model is a language representation model that can be used to analyze the content of the document components of the official document to be reviewed.
- the specific training process of the bert model is: firstly, it is necessary to train the document component content of the document to be reviewed, and then perform the bert model Modeling, and before training the bert model, the existing word vectors in the bert model can be enhanced according to the content of the document components that have been successfully marked in the official document to be reviewed, so that the distribution of the word vector representation is more suitable for the review Official document files, and the bert model can be continuously fine-tuned on the basis of the bert-base when training the bert model to make the word vector distribution more reasonable. Finally, after all the word vectors are trained, they can be obtained from the output position of the bert model.
- the classification result of the document component content classification of the reviewed official document (a category can represent the content of a document component).
- the output of the classification result is in the form of the probability corresponding to the content of different document components in the official document to be reviewed, and each of the classification results After the probability is compared with the preset threshold, the content of the file component corresponding to the classification result can be determined.
- an interface is set up on the server to specifically receive the official document file to be reviewed that is uploaded by the user while sending the review request.
- the official document file to be reviewed uploaded by the user may have multiple file types.
- the file types include but not Limited to .docx or .doc or .pdf, etc.
- the file type conversion module in the server can be used to convert the file type of the official document file to be reviewed into the required standard file type
- the standard file type can be any of .docx, .doc or .pdf
- this embodiment also uses the preset bert model to identify and confirm the file component content of various structures from the structured official document file to be reviewed , So as to facilitate the subsequent data processing of the content of one or more of the file components.
- S20 After synchronously executing text format detection, text content detection, and frame format detection through a preset text processing model constructed based on a distributed framework, obtain format detection results, content detection results, and format detection results;
- the text format detection includes calling and The format detection rule corresponding to each of the file component content, extract the text format keywords in the file component content, and obtain the format detection results according to the text format keywords and the corresponding format bar in the format detection rule
- the text content detection includes the content detection of the document component content, and then obtains the content detection result;
- the frame layout detection includes the division of coordinate information of the official document to be reviewed of the standard document type, and according to the divided The coordinate information is used to perform frame layout testing on the official document to be reviewed to obtain the layout testing result;
- the format detection result is completed by the rule engine.
- the specific execution of the rule engine is mainly divided into accepting data input, interpreting preset rules, and making rule decision processes according to preset rules.
- this embodiment passes The format detection rules are used to perform file format detection on the file component content of the official document to be reviewed, mainly during the operation of the rule engine, through the keyword search technology to execute the corresponding format detection rules on the content of each file component, visible, various component content
- the corresponding format detection rules are not inconsistent, so one component content corresponds to at least one format detection rule.
- the official document to be reviewed is an official document
- the official document has a document number
- the document number is determined by the issuing unit
- a format bar consisting of typography+year+serial number.
- the keyword search technology in the rule engine extracts the text format keywords corresponding to the document number in the document to be reviewed, and judges whether the text format keywords are the same
- the format detection rules text format keywords required by the format bar
- the above-mentioned agreement can confirm that the file format of the content of the file is correct, and vice versa. The content of other files is tested in the same way).
- the rule engine is used to detect the file format of the official document to be reviewed, because the rule engine can perform the detection based on preset rules without coding, and it is convenient to modify the format detection rules to adapt to the format of the official document to be reviewed. Detecting rule changes, and using the rule engine to improve the detection speed, and the format detection results output by the rule engine and associated with the content of various file components can be centrally recorded in the rule engine for easy export and use.
- the content detection result is completed by the NLP algorithm engine.
- the NLP (natural language processing, NLP) algorithm engine is a natural language processing algorithm engine. Its basic task is to determine the syntactic structure of a sentence or the dependency relationship between words in a sentence (in The sentences here are contained in the content of various file components).
- the operation tasks of the NLP algorithm engine can be summarized as typo recognition, slang recognition, name recognition, and part-of-speech tagging.
- the NLP algorithm engine in this embodiment It uses the corresponding typos recognition, slang recognition, name recognition and part-of-speech tagging models to detect the correctness and errors of the expression, combination and punctuation of the document content of the official document to be reviewed.
- the expression errors include but are not limited to typos, repetition of text, Slang or Internet terms, for example, in the "Statistics Information Center Center Health and Medical Big Data", the repeated occurrence of the word “center” is a text error in the above-mentioned expression error.
- “data” “ ⁇ here” is slang, which is the slang in the above-mentioned expression error.
- the combination error is mainly that some words and word collocations should not appear in the official documents to be reviewed in specific scenes, such as in In the official documents to be reviewed corresponding to the scene of the meeting minutes, ordinary people do not directly match the verbs. Punctuation errors include regular punctuation errors and fixed collocation errors.
- the NLP algorithm engine is used to detect the content of the file after the text is parsed, and the content detection result can be detected more accurately based on human thinking and language habits, and the detection efficiency is improved.
- the format detection result is completed by the NLP algorithm engine.
- the OCR (Optical Character Recognition) algorithm engine is an optical character recognition algorithm engine, which recognizes optical characters through image processing and pattern recognition technology.
- the OCR algorithm engine mainly converts the official document file to be reviewed of the standard file type into the official document file to be reviewed of the preset file type to realize the frame layout detection of the official document file to be reviewed of the preset file type (where the frame format includes text format and Typesetting), and use any two sides of a page of the document to be reviewed as the coordinate axis, analyze each text block in the document to be reviewed to obtain the coordinate information of the optical character in each text block, and pass the coordinate of the optical character
- the information determines whether the text format and layout of the official document to be reviewed is consistent with the text format and layout requirements to determine the correctness and error of the text format and layout, for example, the document signature requirements and closing words in the official document to be reviewed are two blank lines, and the signature is blank Keep to the right, all the page numbers in the official
- the OCR algorithm engine is used to detect the frame layout of the official document document to be reviewed, because the OCR algorithm engine has a low misrecognition rate and high recognition speed of the optical characters in the official document document to be reviewed, and borrows the optical Characters to further determine the correctness and errors of the text format and typesetting of the official document to be reviewed.
- S30 Generate detected error content based on the format detection result, content detection result, and layout detection result, call the standard writing rule corresponding to the detected error content, and mark the detected error content and the standard writing rule in all State the preset location in the official document file to be reviewed, and send the official document file to be reviewed that has been successfully marked to the preset receiving location according to the storage path specified by the user.
- the above-mentioned format detection results, content detection results, and layout detection results can be obtained through a preset text processing model, where the preset text processing model includes a rule engine that performs format detection (corresponding to the format detection result), execution The NLP algorithm engine for file content detection (corresponding to content detection results) and the OCR algorithm engine (corresponding to layout detection results) that performs frame layout detection, and the three engines are deployed in distributed frameworks; detecting error content includes all format detection Results, content detection results and error detection results in the layout detection results.
- the detection result of an error in the detection error content corresponds to at least one standard writing rule; the default location is the document to be reviewed and the detection error content and standard writing rule The corresponding location is the location where the error occurred in the official document to be reviewed.
- the detected error content and standard writing rules are marked in the preset positions in the official document to be reviewed. On the one hand, it is to allow users to clearly observe the wrong results and causes of errors, and on the other hand, it is convenient for users to follow the standard writing rules. Change the content of detection errors in the official documents to be reviewed.
- the standard writing rules are stored in a blockchain, the detected error content is generated from the format detection result, content detection result, and format detection result, and the standard writing rule corresponding to the detected error content is called, The content of the detected errors and the standard writing rules are marked in the preset position in the official document to be reviewed, and the official document to be reviewed that has been successfully marked is sent to the preset according to the storage path specified by the user
- the receiving location also includes:
- the detection error content is generated, the standard writing rule corresponding to the detection error content is called, and the detection error content, the scoring result and the standard writing rule are combined. Mark in the preset location in the official document to be reviewed, and send the successfully marked official document to be reviewed to the preset receiving location according to the storage path specified by the user.
- the preset scoring model has preset scoring tables with scores corresponding to various detection results.
- the preset scoring model queries the scores corresponding to the detection results in each dimension in the scoring table, and combines
- the scoring scores of each test result (format test result, content test result, and layout test result) are summarized to obtain the scoring result of the official document to be reviewed (the scoring result includes the total score and the score corresponding to a single test result), where ,
- the scoring rules corresponding to the scoring table can be set according to the needs. For example, the scoring rule is that if a typo or wrong punctuation appears in the official document to be reviewed, 2 points will be deducted, and 10 points will be deducted.
- the above standard writing rules can also be stored in a node of a blockchain.
- the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- the decentralized and fully distributed DNS service provided by the blockchain can realize the query and resolution of domain names through the point-to-point data transmission service between various nodes in the network, which can be used to ensure that the operating system and firmware of an important infrastructure are not available. If it is tampered with, you can monitor the status and integrity of the software, find bad tampering, and ensure that the transmitted data has not been tampered with.
- Store the standard writing rules in the blockchain which can ensure the privacy and security of the standard writing rules.
- the file component content includes a file title; the invoking a format detection rule corresponding to each of the file component content includes:
- the document title of the document component content of the review document can be used to determine the document type of the document to be reviewed, because official document writing requires 15 official document types to be reflected in the document title , Content that does not belong to the 15 official document types will be judged as other document types.
- This embodiment is mainly for determining whether the official document to be reviewed belongs to the preset language, so that the next processing of the official document to be reviewed in the preset language can be further performed.
- the method further includes:
- the official document file to be reviewed in this embodiment is uploaded by the user.
- This embodiment is mainly used to exclude
- one format detection rule includes a format bar of at least one data type and a combination of each of the data types
- the content of a document corresponds to a format detection rule
- the document number corresponds to the format detection rule of the issuing unit code+year+serial number (Pingbaofa[201X]X number)
- the date of the document signing corresponds to the digital year+the digital month +Digital day (201X year XX month XX day) format detection rules, among which, the issuing unit code, year, serial number and number are all data types, while the issuing unit code+year+serial number and digital year+digital month+digit Day is a format bar in the form of a combination of data types, so this embodiment first uses the rule engine to call the format detection rule corresponding to the content of each file component, that is, to determine the format bar of the content of each file component; then use the rule engine’s Keyword search technology detects whether the text format keywords in the content of each file component are consistent with the format bar in the corresponding format detection rule, such as the number in the format bar of the date of the file signing + number month + number day Whether it is Arabic
- the method further includes:
- the official document file to be reviewed in this embodiment may contain tables or diagrams that are not conventional file components, at this time, the contents of the table or diagram can be parsed, and the table or diagram can be converted into conventional file components. Content, so as to ensure the integrity of the content of each document component of the official document to be reviewed during the review process.
- the division of the coordinate information of the official document to be reviewed of the standard document type, and the frame format detection of the official document to be reviewed according to the divided coordinate information includes:
- OCR algorithm engine uses the OCR algorithm engine to input the document to be reviewed of the preset document type into the document block division model associated with the text type of the document to be reviewed, receive the divided text blocks output by the fast document division model, and extract The coordinate information of the divided text block; the coordinate information represents the size and position of the divided text block;
- frame layout detection is performed on the text format and typesetting of the official document to be reviewed of the standard document type.
- the official document file to be reviewed must be well recognized and tested by the OCR algorithm engine.
- the official document file to be reviewed in the standard file type can be converted into the official document file to be reviewed in the preset file type (such as the official document to be reviewed in the PDF file type).
- the file block division model is to divide the official document file of the preset file type to be reviewed into multiple text blocks that can be easily recognized, and each text block has at least one optical Characters, each optical character can correspond to at least one coordinate information, and the size and position of the optical character in the text block can also be determined by the coordinate information, so finally the text block character can be determined by the coordinate information of the optical character in the text block Whether the format and typesetting are consistent with the text format and typesetting requirements to achieve the detection of the text format and typesetting of the official document to be reviewed.
- a preset text processing model constructed through a distributed framework can simultaneously realize the review of multiple standardized requirements for official documents to be reviewed (including the text format and text content of the official document to be reviewed).
- frame format review processing not only does not require the help of manpower, but also can quickly and accurately complete the review of a document to be reviewed, which can ensure that the document review specification points are not missed, and it can be seen that the review efficiency and review accuracy can be improved; and each The review of this kind of standardized requirements exists separately and does not affect each other, and the detected error content and standard writing rules are marked in the preset position in the document to be reviewed in the way of annotation, so that the user can directly modify the document to be reviewed according to the content of the annotation .
- an official document processing device is provided, and the official document processing device corresponds to the official document processing method in the above-mentioned embodiment one-to-one.
- the official document processing device includes an identification module 11, an acquisition module 12 and a sending module 13.
- the detailed description of each functional module is as follows:
- the identification module 11 is used to receive the review request containing the official document to be reviewed sent by the user, analyze the format of the official document to be reviewed and obtain the file type of the official document to be reviewed, and then obtain the standard document type. Review the official documents, and identify the content of all the document components in the official document to be reviewed of the standard document type through the preset bert model;
- the obtaining module 12 is used to obtain format detection results, content detection results, and format detection results after synchronously performing text format detection, text content detection, and frame format detection through a preset text processing model constructed based on a distributed framework; the text format The detection includes calling the format detection rules corresponding to the content of each of the file components, extracting text format keywords in the file component content, and according to the text format keywords and the format bars in the format detection rules corresponding to the text format keywords.
- the text content detection includes the content detection of the file component content, and then obtain the content detection result;
- the frame layout detection includes the division of coordinate information on the official document file to be reviewed of the standard file type, And according to the divided coordinate information, the frame format test is performed on the official document to be reviewed, and the format test result is obtained;
- the sending module 13 is used to generate detected error content based on the format detection result, content detection result, and format detection result, call the standard writing rule corresponding to the detected error content, and write the detected error content and the standard
- the rules are marked at a preset location in the official document to be reviewed, and the official document to be reviewed that has been successfully marked is sent to a preset receiving location according to the storage path specified by the user.
- the standard writing rules are stored in a blockchain, and the sending module includes:
- the first obtaining sub-module is configured to input the format detection result, content detection result, and layout detection result into a preset scoring model for scoring, and obtain the scoring result of the official document to be reviewed output by the preset scoring model;
- the sending sub-module is used to generate detected error content based on the format detection result, content detection result, and layout detection result, call standard writing rules corresponding to the detected error content, and combine the detection error content and the scoring result And the standard writing rules are marked in a preset location in the official document to be reviewed, and the official document to be reviewed that has been successfully marked is sent to a preset receiving location according to the storage path specified by the user.
- the acquisition module includes:
- the first determining sub-module is configured to determine the document type of the official document to be reviewed according to the document title of the official document to be reviewed;
- the first calling sub-module is configured to use the rule engine to call the format detection rules corresponding to the content of each of the file components when the document type belongs to a preset language type;
- the prompting sub-module is used for prompting that the official document to be reviewed does not belong to the official document when the document type does not belong to the preset language type.
- the official document processing device further includes:
- the rejection module is used to prompt the user to re-upload the official document to be reviewed and reject the current review request when the document content does not exist in the official document to be reviewed.
- the acquisition module includes:
- the second calling sub-module is configured to use the rule engine to call the format detection rule corresponding to the content of each of the file components; one format detection rule includes at least one data type and the format of the combination of each of the data types Article;
- the second determining sub-module is used to extract the text format keywords in the content of the file component through the keyword search technology of the rule engine, and determine whether the text format keywords are in the corresponding format detection rules
- the format bars are consistent; one of the file component contents corresponds to at least one of the text format keywords;
- the second obtaining submodule is used to obtain the format of the file component content in the correct file format when the text format keywords in the file component content are consistent with the corresponding format bar in the format detection rule Test results;
- the third obtaining submodule is used to obtain the file format of the file component content in the wrong file format when the text format keywords in the file component content are inconsistent with the corresponding format bar in the format detection rule Format test result.
- the official document processing device further includes:
- the recording module is used to analyze the table when it is detected that there is a table in the official document file to be reviewed, and record the content of each table in the table after the analysis as the file component content.
- the acquisition module includes:
- the conversion sub-module is used to convert the official document to be reviewed of the standard document type into a preset document type to obtain the official document to be reviewed of the preset document type;
- the extraction sub-module is configured to use the OCR algorithm engine to input the document file to be reviewed of the preset document type into the file block division model associated with the text type of the document file to be reviewed, and to receive the output of the file fast division model Extract the coordinate information of the divided text block; the coordinate information represents the size and position of the divided text block;
- the detection sub-module is used to perform frame layout detection on the text format and typesetting of the official document to be reviewed of the standard document type according to the coordinate information.
- Each module in the above-mentioned official document processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
- the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
- the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a readable storage medium and an internal memory.
- the readable storage medium stores an operating system, computer readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
- the database of the computer equipment is used to store the data involved in the method of processing official documents.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instructions are executed by the processor to realize a method for processing official documents.
- the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
- a computer device including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor.
- the processor executes the computer-readable instructions
- the document in the above-mentioned embodiment is implemented.
- the steps of the file processing method are, for example, step S10 to step S30 shown in FIG. 2.
- the processor executes the computer-readable instructions
- the functions of the modules/units of the official document processing apparatus in the above-mentioned embodiment are realized, for example, the functions of the modules 11 to 13 shown in FIG. 3. To avoid repetition, I won’t repeat them here.
- one or more readable storage media storing computer readable instructions are provided.
- the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the steps of the official document processing method in the above-mentioned embodiment, for example Steps S10 to S30 shown in FIG. 2.
- the one or more processors realize the functions of the modules/units of the official document processing apparatus in the foregoing embodiment, for example, modules 11 to modules shown in FIG. 7 13 functions. To avoid repetition, I won’t repeat them here.
- Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
- the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
本申请属于大数据领域,尤其涉及一种公文文件处理方法、装置、计算机设备及存储介质。所述方法包括:对待评审公文文件进行格式解析后,获取标准文件类型的待评审公文文件,并识别标准文件类型的待评审公文文件中的所有文件成分内容;通过预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;通过格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与检测错误内容对应的标准写作规则,将检测错误内容和标准写作规则标注在待评审公文文件中。本申请还涉及区块链技术,所述标准写作规则存储于区块链中。通过本申请能提高公文文件的评审效率。
Description
本申请要求于2020年6月10日提交中国专利局、申请号为202010523793.0,发明名称为“公文文件处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及大数据的数据分析领域,尤其涉及一种公文文件处理方法、装置、计算机设备及存储介质。
背景技术
目前,在将要发布公文文件时,首先需要对公文文件的规范进行评审,在现有技术中,需要人工对公文文件的文面格式和要求等进行详尽的评审,在人工评审公文文件的过程中,要求评审人员对《党政机关公文处理工作条例》的15类公文文种中不同组成要素的各个评审规范都了如指掌,但目前各机关部门的公文文件产出量大,发明人意识到,若采用人工针对每次不同的公文文件去进行逐字逐句的评审,耗时耗力,而且极易出现评审规范点遗漏的问题。因此本领域人员亟需寻找一种可自动对公文文件进行精准评审的方法以解决上述提到的问题。
申请内容
基于此,有必要针对上述技术问题,提供一种公文文件处理方法、装置、计算机设备及存储介质,用于自动化对公文文件进行评审处理,以提高公文文件的评审效率和评审精准度。
一种公文文件处理方法,包括:
接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
一种公文文件处理装置,包括:
识别模块,用于接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
获取模块,用于通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
发送模块,用于通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:
接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
上述公文文件处理方法、装置、计算机设备及存储介质,通过分布式框架构建的预设文本处理模型同时实现待评审公文文件多种规范化要求的评审(包括对待评审公文文件的文本格式、文本内容和框架版式的评审处理),不仅无需借助人力,而且可快速精准完成一篇待评审公文文件的评审,能确保文件评审规范点不被遗漏,可见能提高评审效率和评审精准度;且每一种规范化要求的评审是单独存在并不互相影响,并以批注方式将检测错误内容和标准写作规则标注在待评审公文文件中的预设位置,从而用户可直接根据批注内容对待评审公文文件进行修改。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中公文文件处理方法的一应用环境示意图;
图2是本申请一实施例中公文文件处理方法的一流程图;
图3是本申请一实施例中公文文件处理装置的结构示意图;
图4是本申请一实施例中计算机设备的一示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的公文文件处理方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务器进行通信。其中,客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种公文文件处理方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
S10,接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
可理解地,待评审公文文件可为各种机关和机构中待审查规范的公文文件,其中,每一种待评审公文文件至少存在一个以上的标准公文文件所具备的规范化要求,该规范化要求可包括但不限于文件格式、文件内容、文件文字格式和文件排版等,具体地,用户可在显示设备选择出待评审公文文件的评审要求后由该显示设备发出该评审请求,其中,该评审要求是可对待评审公文文件中所有的规范化要求进行检测;识别标准文件类型的待评审公文文件是将结构化的文本数据(待评审公文文件)转换为有意义的文本数据而进行文本解析的过程,该过程结束之后可以得到各种结构成分的文件成分内容,其中,识别过程中可使用语言、统计和机器学习模型等多种解析技术,在本实施例中主要是从待评审公文文件中分析、挖掘和识别出所有文件成分内容,其中,本实施例提到的文件成分内容包括文件文号、文件标题、文件主送单位、文件正文、文件落款、文件附件和文件公文附注等;预设bert模型是一种可用于对待评审公文文件的文件成分内容进行分析的语言表征模型,该bert模型具体的训练过程为:首先需要训练待评审公文文件中的文件成分内容进行标注,接着对bert模型进行建模,且在对bert模型训练之前可根据待评审公文文件中已标注成功的文件成分内容对bert模型中已有的词向量进行增强训练,以令词向量表征的分布更贴合于待评审公文文件,且在对bert模型训练时可通过bert-base的基础上对bert模型不断进行微调以令词向量分布更加合理,最后对所有的词向量训练完成后,可从bert模型输出位置获取待评审公文文件的文件成分内容分类(一种类别可代表一个文件成分内容)的分类结果,该分类结果输出的形式为待评审公文文件中不同文件成分内容所对应的概率,将分类结果中的各个概率与其预设阈值进行对比后,就可确定出该分类结果所对应的文件成分内容。本实施例中,服务器上设置了一个接口来专门接收用户在发出评审请求时而同时上传的待评审公文文件,但用户上传的待评审公文文件可能存在多种文件类型,其中,文件类型包括但不限于.docx或.doc或.pdf等,因此为了统一且快速识别出待评审公文文件,可通过该服务器中的文件类型转换模块来将待评审公文文件的文件类型转换成所需的标准文件类型,标准文件类型可为.docx或.doc或.pdf中的任意一种,且本实施例也通过预设bert模型从结构化的待评审公文文件中识别和确认出各种结构的文件成分内容,从而方便于后续对其中一个或多个文件成分内容进行数据处理。
S20,通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
可理解地,格式检测结果是通过规则引擎来完成,其中,规则引擎具体执行主要分为接受数据输入,解释预设规则和根据预设规则做出规则决策过程,具体地,本实施例是通过格式检测规则来对待评审公文文件的文件成分内容进行文件格式检测,主要是在规则引擎的运行过程通过关键词检索技术别对各文件成分内容执行与其对应的格式检测规则,可见,各种成分内容对应的格式检测规则并不一致,因此一种成分内容至少对应一种格式检测规则,比如,在待评审公文文件为公文文件时,其中公文文件存在一个文件文号,而文件文号是由发文单位代字+年份+序号组成的格式条,此时通过规则引擎中的关键词检索技术提取该待评审公文文件中的文件文号所对应的文本格式关键词,并判断该文本格式关键词是否与文件文号对应的格式检测规则(格式条所要求的文本格式关键词)一致,从而来确定出待评审公文文件中的文件成分内容中的文件文号的文件格式是正确或错误的检测结果(上述提到的一致可确定出该文件成分内容的文件格式是正确,反之亦然,其他文件成分内容同理检测)。本实施例利用规则引擎来对待评审公文文件的文件格式进行检测,是因为规则引擎是可基于预设规则去执行检测,无需进行编码,方便对格式检测规则进行修改而适应待评审公文文件的格式检测规则的变化,且利用规则引擎可提升检测速度,并可将规则引擎输出且与各种文件成分内容关联的格式检测结果在规则引擎中集中记录以便于导出使用。
内容检测结果是通过NLP算法引擎来完成,其中,NLP(natural language processing, NLP)算法引擎为自然语言处理算法引擎,其基本任务是确定句子的句法结构或者句子中词汇之间的依存关系(在此中的句子存在各种文件成分内容中),在本实施例中NLP算法引擎的运行任务可总结为错别字识别、俚语识别、人名识别和词性标注,具体地,本实施例中的NLP算法引擎是通过与其对应的错别字识别、俚语识别、人名识别和词性标注模型来检测待评审公文文件的文件内容的表述、组合及标点的正确与错误,其中,表述错误包括但不限于错别字、文字重复、俚语或网络用语等,比如,在“统计信息中心中心健康医疗大数据”中,“中心”二字重复出现,则为上述提到的表述错误中的文字错误,在“打这儿开始收集健康医疗数据”中,“打这儿”为俚语,则为上述提到的表述错误中的俚语,组合错误主要为一些词语与词语的搭配应当不予出现在特定的场景的待评审公文文件中,比如在会议纪要的场景对应的待评审公文文件中,一般人民不直接与动词进行搭配,标点错误包括常规标点错误和固定搭配点错误,比如,小标题“一”后面要加“、”,“科技+金融”中引号和加号的使用。本实施例利用NLP算法引擎对文本解析后的文件成分内容进行文件内容的检测,能基于人类的思维和语言的习惯来较准确检测出内容检测结果,并提升检测的效率。
版式检测结果是通过NLP算法引擎来完成,其中,OCR(Optical Character Recognition)算法引擎为光学字符识别算法引擎,其通过图像处理和模式识别技术对光学字符进行识别,具体地,本实施例中的OCR算法引擎主要是将标准文件类型的待评审公文文件转换为预设文件类型的待评审公文文件以实现对预设文件类型的待评审公文文件进行框架版式检测(其中,框架版式包括文字格式和排版),并以待评审公文文件中的一页文件的任意两边为坐标轴,解析待评审公文文件中的各个文本块后得到各个文本块中的光学字符的坐标信息,并通过光学字符的坐标信息确定出待评审公文文件的文字格式和排版是否与文字格式和排版要求的一致以确定文字格式和排版的正确与错误,比如,待评审公文文件中的文件落款要求与结束语空2行,落款靠右,待评审公文文件中的所有的页码需要求设成奇偶页不同,页码字体用宋体小四号,奇数页码在右边,并右缩进一个字符,偶数页码在左边并左缩进一个字符,并分别在奇偶页码的左右两边加上一条横线“—”,若通过光学字符的坐标信息识别出来的文字格式或/和排版不与上述文字格式和排版要求的一致,则可确定待评审公文文件的文字格式或/和排版存在错误,反之亦然。本实施例利用OCR算法引擎对标准文件类型的待评审公文文件的框架版式进行检测,是因为该OCR算法引擎对待评审公文文件中的光学字符的误识率低和识别速度高,并借用该光学字符来进一步确定出待评审公文文件的文字格式和排版的正确与错误。
S30,通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
可理解地,上述的格式检测结果、内容检测结果和版式检测结果是可通过预设文本处理模型获取,其中,预设文本处理模型中包括执行格式检测的规则引擎(对应格式检测结果)、执行文件内容检测的NLP算法引擎(对应内容检测结果)和执行框架版式检测的OCR算法引擎(对应版式检测结果),且三种引擎分别部署在分布式框架中;检测错误内容包含了所有的格式检测结果、内容检测结果和版式检测结果中错误的检测结果,其中,检测错误内容中一个错误的检测结果至少对应一个标准写作规则;预设位置是待评审公文文件中与检测错误内容和标准写作规则对应的位置,也即是待评审公文文件中出现错误的位置。本实施例将检测错误内容和标准写作规则标注在待评审公文文件中的预设位置中,一方面是为了让用户清楚观察到错误结果和错误原因,另一方面是便于用户后续根据标准写作规则更改待评审公文文件中的检测错误内容。
进一步地,所述标准写作规则存储于区块链中,所述通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置,还包括:
将所述格式检测结果、内容检测结果和版式检测结果输入至预设评分模型进行评分,获取所述预设评分模型输出的所述待评审公文文件的评分结果;
通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容、所述评分结果和所述标准写作规则标注在所述待评审公文文件中的预设位置中,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
可理解地,预设评分模型中已预设设置了各种检测结果对应的评分分数的评分表,该预设评分模型通过查询评分表中的各维度中与检测结果对应的评分分数,并将各检测结果(格式检测结果、内容检测结果和版式检测结果)的评分分数汇总得到待评审公文文件的评分结果(该评分结果包括总评分情况也包括单独一项检测结果对应的评分情况),其中,评分表对应的评分分数可根据需求自行设置评分规则,比如,评分规则为待评审公文文件中出现一次错别字及错别标点符号则扣2分,并在扣满10分为止。
另外需要强调的是,为进一步保证上述标准写作规则的私密和安全性,上述标准写作规则还可以存储于一区块链的节点中。其中,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。区块链提供的去中心化的完全分布式DNS服务通过网络中各个节点之间的点对点数据传输服务就能实现域名的查询和解析,可用于确保某个重要的基础设施的操作系统和固件没有被篡改,可以监控软件的状态和完整性,发现不良的篡改,并确保所传输的数据没用经过篡改,将标准写作规则存储在区块链中,能够确保标准写作规则的私密和安全性。
进一步地,所述文件成分内容包括文件标题;所述调用与每一个所述文件成分内容对应的格式检测规则,包括:
根据所述待评审公文文件的所述文件标题确定所述待评审公文文件的文件文种;
在所述文件文种属于预设文种时,利用所述规则引擎调用与各所述文件成分内容对应的格式检测规则;
在所述文件文种不属于预设文种时,提示所述待评审公文文件不属于公文文件。
可理解地,在评审文件为公文文件时,可通过评审文件的文件成分内容中的文件标题确定待评审公文文件的文件文种,因为公文写作要求在文件标题中体现出15种公文文种名称,而不属于15类公文文种的内容将被判定为其他文件文种。本实施例主要是为了判定待评审公文文件是否属于预设文种,从而可进一步地对预设文种的待评审公文文件进行下一步处理。
进一步地,所述对所述待评审公文文件进行格式解析之后,还包括:
在所述待评审公文文件中不存在文件内容时,提示所述用户重新上传所述待评审公文文件并驳回当前所述评审请求。
可理解地,本实施例中的待评审公文文件是用户自行上传的,在用户上传的待评审公文文件不存在文件内容时,无需对待评审公文文件进行评审,本实施例主要是用来排除不存在文件内容的待评审公文文件,避免增加服务器的工作量。其中,待评审公文文件中是否存在文件内容可通过随机在待评审公文文件中的任意区域中进行复制,并查看是否存在粘贴的内容来进行确定或者通过文字数字识别模型来进行识别确定。
进一步地,所述调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果,包括:
利用规则引擎调用与每一个所述文件成分内容对应的格式检测规则;一个所述格式检测规则中包含至少一种数据类型及各所述数据类型的组合形式的格式条;
通过所述规则引擎的关键词搜索技术提取所述文件成分内容中的文本格式关键词,并确定所述文本格式关键词是否和与其对应的所述格式检测规则中的格式条一致;一个所述文件成分内容对应至少一个所述文本格式关键词;
在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条一致时,获取所述文件成分内容的文件格式正确的所述格式检测结果;
在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条并不一致时,获取所述文件成分内容的文件格式错误的所述格式检测结果。
具体地,一个文件成分内容对应一个格式检测规则,文件文号对应发文单位代字+年份+序号(平保发〔201X〕X号)的格式检测规则,文件落款的日期对应数字年+数字月+数字日(201X年XX月XX日)的格式检测规则,其中,发文单位代字、年份、序号和数字都为数据类型,而发文单位代字+年份+序号和数字年+数字月+数字日为数据类型的组合形式的格式条,从而本实施例首先利用规则引擎调用出与每一个文件成分内容对应的格式检测规则,也即确定出各文件成分内容的格式条;接着利用规则引擎的关键词搜索技术检测各个文件成分内容中的文本格式关键词是否和与其对应的格式检测规则中的格式条一致,比如文件落款的日期中的数字年+数字月+数字日的格式条中的数字是否为阿拉伯数字等;最后通过比较一致来确定出格式检测结果以实现对待评审公文文件的文件格式正确和错误的检测。
进一步地,所述获取标准文件类型的所述待评审公文文件之后,还包括:
在检测到所述待评审公文文件中存在表格时,对所述表格进行解析,将解析之后的所述表格中的各个表格内容分别记录为所述文件成分内容。
可理解地,由于本实施例中的待评审公文文件可能存在表格或者图表等不为常规的文件成分内容,此时可通过解析表格或者图表的内容,并将表格或者图表转换为常规的文件成分内容,从而保证待评审公文文件各文件成分内容在评审过程中的完整性。
进一步地,所述对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,包括:
将标准文件类型的所述待评审公文文件转化成预设文件类型,得到预设文件类型的所述待评审公文文件;
利用OCR算法引擎将预设文件类型的所述待评审公文文件,输入与所述待评审公文文件的文本文种关联的文件块划分模型,接收所述文件快划分模型输出的划分文本块,提取所述划分文本块的坐标信息;所述坐标信息表征了所述划分文本块的大小和位置;
根据所述坐标信息对标准文件类型的所述待评审公文文件的文字格式和排版进行框架版式检测。
可理解地,待评审公文文件要很好被OCR算法引擎进行识别检测,可首先将标准文件类型的待评审公文文件转化成预设文件类型的待评审公文文件(比如PDF文件类型的待评审公文文件),从而保证识别检测过程中的稳定性;文件块划分模型是为了将预设文件类型的待评审公文文件划分成多个可便于识别的文本块,其中,每个文本块存在至少一个光学字符,每个光学字符可对应至少一个坐标信息,也通过该坐标信息确定出文本块中的光学字符的大小和位置,因此最后可通过该文本块中的光学字符的坐标信息确定出文本块文字格式和排版是否与文字格式和排版要求的一致以实现对待评审公文文件的文字格式和排版的检测。
综上所述,上述提供了一种公文文件处理方法,通过分布式框架构建的预设文本处理模型同时实现待评审公文文件多种规范化要求的评审(包括对待评审公文文件的文本格式、文本内容和框架版式的评审处理),不仅无需借助人力,而且可快速精准完成一篇待评审公文文件的评审,能确保文件评审规范点不被遗漏,可见能提高评审效率和评审精准度;且每一种规范化要求的评审是单独存在并不互相影响,并以批注方式将检测错误内容和标准写作规则标注在待评审公文文件中的预设位置,从而用户可直接根据批注内容对待评审公文文件进行修改。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种公文文件处理装置,该公文文件处理装置与上述实施例中公文文件处理方法一一对应。如图3所示,该公文文件处理装置包括识别模块11、获取模块12和发送模块13。各功能模块详细说明如下:
识别模块11,用于接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;
获取模块12,用于通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;
发送模块13,用于通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
进一步地,所述标准写作规则存储于区块链中,所述发送模块包括:
第一获取子模块,用于将所述格式检测结果、内容检测结果和版式检测结果输入至预设评分模型进行评分,获取所述预设评分模型输出的所述待评审公文文件的评分结果;
发送子模块,用于通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容、所述评分结果和所述标准写作规则标注在所述待评审公文文件中的预设位置中,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
进一步地,所述获取模块包括:
第一确定子模块,用于根据所述待评审公文文件的所述文件标题确定所述待评审公文文件的文件文种;
第一调用子模块,用于在所述文件文种属于预设文种时,利用所述规则引擎调用与各所述文件成分内容对应的格式检测规则;
提示子模块,用于在所述文件文种不属于预设文种时,提示所述待评审公文文件不属于公文文件。
进一步地,所述公文文件处理装置还包括:
驳回模块,用于在所述待评审公文文件中不存在文件内容时,提示所述用户重新上传所述待评审公文文件并驳回当前所述评审请求。
进一步地,所述获取模块包括:
第二调用子模块,用于利用规则引擎调用与每一个所述文件成分内容对应的格式检测规则;一个所述格式检测规则中包含至少一种数据类型及各所述数据类型的组合形式的格式条;
第二确定子模块,用于通过所述规则引擎的关键词搜索技术提取所述文件成分内容中的文本格式关键词,并确定所述文本格式关键词是否和与其对应的所述格式检测规则中的格式条一致;一个所述文件成分内容对应至少一个所述文本格式关键词;
第二获取子模块,用于在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条一致时,获取所述文件成分内容的文件格式正确的所述格式检测结果;
第三获取子模块,用于在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条并不一致时,获取所述文件成分内容的文件格式错误的所述格式检测结果。
进一步地,所述公文文件处理装置还包括:
记录模块,用于在检测到所述待评审公文文件中存在表格时,对所述表格进行解析,将解析之后的所述表格中的各个表格内容分别记录为所述文件成分内容。
进一步地,所述获取模块包括:
转化子模块,用于将标准文件类型的所述待评审公文文件转化成预设文件类型,得到预设文件类型的所述待评审公文文件;
提取子模块,用于利用OCR算法引擎将预设文件类型的所述待评审公文文件,输入与所述待评审公文文件的文本文种关联的文件块划分模型,接收所述文件快划分模型输出的划分文本块,提取所述划分文本块的坐标信息;所述坐标信息表征了所述划分文本块的大小和位置;
检测子模块,用于根据所述坐标信息对标准文件类型的所述待评审公文文件的文字格式和排版进行框架版式检测。
关于公文文件处理装置的具体限定可以参见上文中对于公文文件处理方法的限定,在此不再赘述。上述公文文件处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储公文文件处理方法中涉及到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种公文文件处理方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中公文文件处理方法的步骤,例如图2所示的步骤S10至步骤S30。或者,处理器执行计算机可读指令时实现上述实施例中公文文件处理装置的各模块/单元的功能,例如图3所示模块11至模块13的功能。为避免重复,这里不再赘述。
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述实施例中公文文件处理方法的步骤,例如图2所示的步骤S10至步骤S30。或者,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述实施例中公文文件处理装置的各模块/单元的功能,例如图7所示模块11至模块13的功能。为避免重复,这里不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述计算机可读指令可存储于一非易失性可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。
Claims (20)
- 一种公文文件处理方法,其中,包括:接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 根据权利要求1所述的公文文件处理方法,其中,所述通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置,包括:将所述格式检测结果、内容检测结果和版式检测结果输入至预设评分模型进行评分,获取所述预设评分模型输出的所述待评审公文文件的评分结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容、所述评分结果和所述标准写作规则标注在所述待评审公文文件中的预设位置中,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 根据权利要求1所述的公文文件处理方法,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,包括:根据所述待评审公文文件的所述文件标题确定所述待评审公文文件的文件文种;在所述文件文种属于预设文种时,利用所述规则引擎调用与各所述文件成分内容对应的格式检测规则;在所述文件文种不属于预设文种时,提示所述待评审公文文件不属于公文文件。
- 根据权利要求1所述的公文文件处理方法,其中,对所述待评审公文文件进行格式解析之后,还包括:在所述待评审公文文件中不存在文件内容时,提示所述用户重新上传所述待评审公文文件并驳回当前所述评审请求。
- 根据权利要求1所述的公文文件处理方法,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果,包括:利用规则引擎调用与每一个所述文件成分内容对应的格式检测规则;一个所述格式检测规则中包含至少一种数据类型及各所述数据类型的组合形式的格式条;通过所述规则引擎的关键词搜索技术提取所述文件成分内容中的文本格式关键词,并确定所述文本格式关键词是否和与其对应的所述格式检测规则中的格式条一致;一个所述文件成分内容对应至少一个所述文本格式关键词;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条一致时,获取所述文件成分内容的文件格式正确的所述格式检测结果;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条并不一致时,获取所述文件成分内容的文件格式错误的所述格式检测结果。
- 根据权利要求1所述的公文文件处理方法,其中,所述获取标准文件类型的所述待评审公文文件之后,还包括:在检测到所述待评审公文文件中存在表格时,对所述表格进行解析,将解析之后的所述表格中的各个表格内容分别记录为所述文件成分内容。
- 根据权利要求1所述的公文文件处理方法,其中,所述对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,包括:将标准文件类型的所述待评审公文文件转化成预设文件类型,得到预设文件类型的所述待评审公文文件;利用OCR算法引擎将预设文件类型的所述待评审公文文件,输入与所述待评审公文文件的文本文种关联的文件块划分模型,接收所述文件快划分模型输出的划分文本块,提取所述划分文本块的坐标信息;所述坐标信息表征了所述划分文本块的大小和位置;根据所述坐标信息对标准文件类型的所述待评审公文文件的文字格式和排版进行框架版式检测。
- 一种公文文件处理装置,其中,包括:识别模块,用于接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;获取模块,用于通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;发送模块,用于通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 如权利要求9所述的计算机设备,其中,所述通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置,包括:将所述格式检测结果、内容检测结果和版式检测结果输入至预设评分模型进行评分,获取所述预设评分模型输出的所述待评审公文文件的评分结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容、所述评分结果和所述标准写作规则标注在所述待评审公文文件中的预设位置中,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 如权利要求9所述的计算机设备,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,包括:根据所述待评审公文文件的所述文件标题确定所述待评审公文文件的文件文种;在所述文件文种属于预设文种时,利用所述规则引擎调用与各所述文件成分内容对应的格式检测规则;在所述文件文种不属于预设文种时,提示所述待评审公文文件不属于公文文件。
- 如权利要求9所述的计算机设备,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果,包括:利用规则引擎调用与每一个所述文件成分内容对应的格式检测规则;一个所述格式检测规则中包含至少一种数据类型及各所述数据类型的组合形式的格式条;通过所述规则引擎的关键词搜索技术提取所述文件成分内容中的文本格式关键词,并确定所述文本格式关键词是否和与其对应的所述格式检测规则中的格式条一致;一个所述文件成分内容对应至少一个所述文本格式关键词;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条一致时,获取所述文件成分内容的文件格式正确的所述格式检测结果;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条并不一致时,获取所述文件成分内容的文件格式错误的所述格式检测结果。
- 如权利要求9所述的计算机设备,其中,所述获取标准文件类型的所述待评审公文文件之后,所述处理器执行所述计算机可读指令时还实现如下步骤:在检测到所述待评审公文文件中存在表格时,对所述表格进行解析,将解析之后的所述表格中的各个表格内容分别记录为所述文件成分内容。
- 如权利要求9所述的计算机设备,其中,所述对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,包括:将标准文件类型的所述待评审公文文件转化成预设文件类型,得到预设文件类型的所述待评审公文文件;利用OCR算法引擎将预设文件类型的所述待评审公文文件,输入与所述待评审公文文件的文本文种关联的文件块划分模型,接收所述文件快划分模型输出的划分文本块,提取所述划分文本块的坐标信息;所述坐标信息表征了所述划分文本块的大小和位置;根据所述坐标信息对标准文件类型的所述待评审公文文件的文字格式和排版进行框架版式检测。
- 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:接收用户发送的包含待评审公文文件的评审请求,对所述待评审公文文件进行格式解析并获取所述待评审公文文件的文件类型后,获取标准文件类型的所述待评审公文文件,并通过预设bert模型识别标准文件类型的所述待评审公文文件中的所有文件成分内容;通过基于分布式框架构建的预设文本处理模型同步执行文本格式检测、文本内容检测以及框架版式检测之后,获取格式检测结果、内容检测结果和版式检测结果;所述文本格式检测包括调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果;所述文本内容检测包括对所述文件成分内容进行内容检测后,获取内容检测结果;所述框架版式检测包括对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,获取版式检测结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 如权利要求15所述的可读存储介质,其中,所述通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容和所述标准写作规则标注在所述待评审公文文件中的预设位置,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置,包括:将所述格式检测结果、内容检测结果和版式检测结果输入至预设评分模型进行评分,获取所述预设评分模型输出的所述待评审公文文件的评分结果;通过所述格式检测结果、内容检测结果和版式检测结果生成检测错误内容,调用出与所述检测错误内容对应的标准写作规则,将所述检测错误内容、所述评分结果和所述标准写作规则标注在所述待评审公文文件中的预设位置中,并将已标注成功的所述待评审公文文件根据所述用户指定的存放路径发送至预设接收位置。
- 如权利要求15所述的可读存储介质,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,包括:根据所述待评审公文文件的所述文件标题确定所述待评审公文文件的文件文种;在所述文件文种属于预设文种时,利用所述规则引擎调用与各所述文件成分内容对应的格式检测规则;在所述文件文种不属于预设文种时,提示所述待评审公文文件不属于公文文件。
- 如权利要求15所述的可读存储介质,其中,所述调用与每一个所述文件成分内容对应的格式检测规则,提取所述文件成分内容中的文本格式关键词,根据所述文本格式关键词以及与其对应的所述格式检测规则中的格式条获取格式检测结果,包括:利用规则引擎调用与每一个所述文件成分内容对应的格式检测规则;一个所述格式检测规则中包含至少一种数据类型及各所述数据类型的组合形式的格式条;通过所述规则引擎的关键词搜索技术提取所述文件成分内容中的文本格式关键词,并确定所述文本格式关键词是否和与其对应的所述格式检测规则中的格式条一致;一个所述文件成分内容对应至少一个所述文本格式关键词;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条一致时,获取所述文件成分内容的文件格式正确的所述格式检测结果;在所述文件成分内容中的文本格式关键词和与其对应的所述格式检测规则中的格式条并不一致时,获取所述文件成分内容的文件格式错误的所述格式检测结果。
- 如权利要求15所述的可读存储介质,其中,所述获取标准文件类型的所述待评审公文文件之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:在检测到所述待评审公文文件中存在表格时,对所述表格进行解析,将解析之后的所述表格中的各个表格内容分别记录为所述文件成分内容。
- 如权利要求15所述的可读存储介质,其中,所述对标准文件类型的所述待评审公文文件进行坐标信息的划分,并根据划分的所述坐标信息以对所述待评审公文文件进行框架版式检测,包括:将标准文件类型的所述待评审公文文件转化成预设文件类型,得到预设文件类型的所述待评审公文文件;利用OCR算法引擎将预设文件类型的所述待评审公文文件,输入与所述待评审公文文件的文本文种关联的文件块划分模型,接收所述文件快划分模型输出的划分文本块,提取所述划分文本块的坐标信息;所述坐标信息表征了所述划分文本块的大小和位置;根据所述坐标信息对标准文件类型的所述待评审公文文件的文字格式和排版进行框架版式检测。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/620,817 US11914968B2 (en) | 2020-06-10 | 2020-12-11 | Official document processing method, device, computer equipment and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010523793.0 | 2020-06-10 | ||
CN202010523793.0A CN111680634B (zh) | 2020-06-10 | 2020-06-10 | 公文文件处理方法、装置、计算机设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021121158A1 true WO2021121158A1 (zh) | 2021-06-24 |
Family
ID=72435411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/135718 WO2021121158A1 (zh) | 2020-06-10 | 2020-12-11 | 公文文件处理方法、装置、计算机设备及存储介质 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11914968B2 (zh) |
CN (1) | CN111680634B (zh) |
WO (1) | WO2021121158A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887361A (zh) * | 2021-09-23 | 2022-01-04 | 苏州浪潮智能科技有限公司 | 一种文献校对方法、系统、存储介质及设备 |
CN114782029A (zh) * | 2022-06-20 | 2022-07-22 | 北京圣博润高新技术股份有限公司 | 文档审核方法、系统、计算机设备及存储介质 |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428367B (zh) * | 2020-03-25 | 2023-08-15 | 无锡先导智能装备股份有限公司 | 工件安装位置检测方法、装置、计算机设备和存储介质 |
CN111680634B (zh) * | 2020-06-10 | 2023-08-01 | 平安科技(深圳)有限公司 | 公文文件处理方法、装置、计算机设备及存储介质 |
CN112363981A (zh) * | 2020-11-13 | 2021-02-12 | 长城计算机软件与系统有限公司 | 用于ldif文件的自动纠错方法及系统 |
CN113435854A (zh) * | 2021-07-05 | 2021-09-24 | 北京致远互联软件股份有限公司 | 一种公文智能签收方法及设备 |
CN113704498A (zh) * | 2021-09-01 | 2021-11-26 | 云知声(上海)智能科技有限公司 | 用于文档的智能审核方法及其系统 |
CN114169294A (zh) * | 2021-11-30 | 2022-03-11 | 中国电子科技集团公司第十五研究所 | 一种基于对抗网络的办公文书自动生成方法及系统 |
CN117151073B (zh) * | 2023-08-12 | 2024-10-18 | 上海东方怡动信息技术有限公司 | 一种文件发文审核方法、装置和存储介质 |
CN117829116A (zh) * | 2023-12-27 | 2024-04-05 | 青矩技术股份有限公司 | 文档调整方法、装置、设备及存储介质 |
CN118468811A (zh) * | 2024-07-15 | 2024-08-09 | 江苏中威科技软件系统有限公司 | 通过机器学习实现格式文件规范化的方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763343A (zh) * | 2008-12-23 | 2010-06-30 | 上海晨鸟信息科技有限公司 | 一种支持格式比对和剽窃检查的文档编辑器原理与方法 |
CN106294568A (zh) * | 2016-07-27 | 2017-01-04 | 北京明朝万达科技股份有限公司 | 一种基于bp网络的中文文本分类规则生成方法及系统 |
CN108984518A (zh) * | 2018-06-11 | 2018-12-11 | 人民法院信息技术服务中心 | 一种面向裁判文书的文本分类方法 |
US20190236102A1 (en) * | 2018-01-29 | 2019-08-01 | Planet Data Solutions | System and method for differential document analysis and storage |
CN111680634A (zh) * | 2020-06-10 | 2020-09-18 | 平安科技(深圳)有限公司 | 公文文件处理方法、装置、计算机设备及存储介质 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2571378C2 (ru) * | 2013-12-18 | 2015-12-20 | Общество с ограниченной ответственностью "Аби Девелопмент" | Устройство и способ поиска различий в документах |
CN108664473A (zh) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | 文本关键信息的识别方法、电子装置及可读存储介质 |
CN110633461B (zh) * | 2019-09-10 | 2024-01-16 | 北京百度网讯科技有限公司 | 文档检测处理方法、装置、电子设备和存储介质 |
WO2021086837A1 (en) * | 2019-10-29 | 2021-05-06 | Woolly Labs, Inc. Dba Vouched | System and methods for authentication of documents |
CN111090986A (zh) * | 2019-11-29 | 2020-05-01 | 福建亿榕信息技术有限公司 | 一种公文文档纠错的方法 |
-
2020
- 2020-06-10 CN CN202010523793.0A patent/CN111680634B/zh active Active
- 2020-12-11 WO PCT/CN2020/135718 patent/WO2021121158A1/zh active Application Filing
- 2020-12-11 US US17/620,817 patent/US11914968B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763343A (zh) * | 2008-12-23 | 2010-06-30 | 上海晨鸟信息科技有限公司 | 一种支持格式比对和剽窃检查的文档编辑器原理与方法 |
CN106294568A (zh) * | 2016-07-27 | 2017-01-04 | 北京明朝万达科技股份有限公司 | 一种基于bp网络的中文文本分类规则生成方法及系统 |
US20190236102A1 (en) * | 2018-01-29 | 2019-08-01 | Planet Data Solutions | System and method for differential document analysis and storage |
CN108984518A (zh) * | 2018-06-11 | 2018-12-11 | 人民法院信息技术服务中心 | 一种面向裁判文书的文本分类方法 |
CN111680634A (zh) * | 2020-06-10 | 2020-09-18 | 平安科技(深圳)有限公司 | 公文文件处理方法、装置、计算机设备及存储介质 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887361A (zh) * | 2021-09-23 | 2022-01-04 | 苏州浪潮智能科技有限公司 | 一种文献校对方法、系统、存储介质及设备 |
CN113887361B (zh) * | 2021-09-23 | 2024-01-09 | 苏州浪潮智能科技有限公司 | 一种文献校对方法、系统、存储介质及设备 |
CN114782029A (zh) * | 2022-06-20 | 2022-07-22 | 北京圣博润高新技术股份有限公司 | 文档审核方法、系统、计算机设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US11914968B2 (en) | 2024-02-27 |
US20220414345A1 (en) | 2022-12-29 |
CN111680634A (zh) | 2020-09-18 |
CN111680634B (zh) | 2023-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021121158A1 (zh) | 公文文件处理方法、装置、计算机设备及存储介质 | |
WO2022105122A1 (zh) | 基于人工智能的答案生成方法、装置、计算机设备及介质 | |
US11972201B2 (en) | Facilitating auto-completion of electronic forms with hierarchical entity data models | |
WO2020147238A1 (zh) | 关键词的确定方法、自动评分方法、装置、设备及介质 | |
US20180005117A1 (en) | Corpus Quality Analysis | |
JP2022547750A (ja) | クロスドキュメントインテリジェントオーサリングおよび処理アシスタント | |
US10503830B2 (en) | Natural language processing with adaptable rules based on user inputs | |
CN109460552B (zh) | 基于规则和语料库的汉语语病自动检测方法及设备 | |
US9224103B1 (en) | Automatic annotation for training and evaluation of semantic analysis engines | |
US11526692B2 (en) | Systems and methods for domain agnostic document extraction with zero-shot task transfer | |
US20090248400A1 (en) | Rule Based Apparatus for Modifying Word Annotations | |
CN111460131A (zh) | 公文摘要提取方法、装置、设备及计算机可读存储介质 | |
US20130031098A1 (en) | Mismatch detection system, method, and program | |
CN111259262A (zh) | 一种信息检索方法、装置、设备及介质 | |
CN115049508A (zh) | 页面生成方法、装置、电子设备及存储介质 | |
CN117707922A (zh) | 测试用例的生成方法、装置、终端设备和可读存储介质 | |
US20170154029A1 (en) | System, method, and apparatus to normalize grammar of textual data | |
CN113705198B (zh) | 场景图生成方法、装置、电子设备及存储介质 | |
US11880798B2 (en) | Determining section conformity and providing recommendations | |
JP2012212329A (ja) | テキストデータの冗長性を解析する情報解析装置 | |
Duran et al. | Some issues on the normalization of a corpus of products reviews in Portuguese | |
CN112529743A (zh) | 合同要素抽取方法、装置、电子设备及介质 | |
CN113050933B (zh) | 脑图数据处理方法、装置、设备及存储介质 | |
CN112989820B (zh) | 法律文书定位方法、装置、设备及存储介质 | |
Miyao et al. | Evaluating textual entailment recognition for university entrance examinations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20901373 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20901373 Country of ref document: EP Kind code of ref document: A1 |