CN116522885A - Standardized file processing method and device, electronic equipment and storage medium - Google Patents
Standardized file processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN116522885A CN116522885A CN202310380755.8A CN202310380755A CN116522885A CN 116522885 A CN116522885 A CN 116522885A CN 202310380755 A CN202310380755 A CN 202310380755A CN 116522885 A CN116522885 A CN 116522885A
- Authority
- CN
- China
- Prior art keywords
- file
- standardized
- standardized file
- standard
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 8
- 238000000034 method Methods 0.000 claims abstract description 66
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 38
- 230000008569 process Effects 0.000 claims description 15
- 238000004891 communication Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000012937 correction Methods 0.000 description 9
- 238000003058 natural language processing Methods 0.000 description 8
- 238000013519 translation Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000009472 formulation Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to a standardized file processing method, a standardized file processing device, electronic equipment and a standardized file storage medium, which are applied to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring a generation request, wherein the generation request comprises request content; determining a target file template corresponding to the request content; inputting the request content into a standardized file generation model to obtain an output result, wherein the standardized file generation model is trained by using a standardized file as a training sample; and generating a target standardized file according to the output result and the target file template. The method solves the problem that in the prior art, only the format of the standardized file can be corrected, but the text content of the standardized file cannot be written.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing a standardized file, an electronic device, and a storage medium.
Background
The standardized file is "a file formulated by a standardized activity". Standardized documents are one type of standardized document that differs from other standardized documents in the formation process primarily in that standardized activities are generated.
The establishment of the standardized file is experienced, technicians with rich experience can better provide written contents, meanwhile, the standardized file has certain standardization, and only the standardized file which is compiled under the guidance of the standardization framework and the standardization contents is called as a good standard.
For writing standardized files, related technologies adopt standardized file writing software based on word plug-ins, and the software only aims at the format problems of the standardized files, namely, the format problems of basic word size, paragraphs, typesetting and the like, so that the text contents of the standardized files cannot be written.
Thus, there is a need for a method that can generate standardized files.
Disclosure of Invention
The application provides a processing method, a processing device, electronic equipment and a storage medium for a standardized file, which are used for solving the problem that in the prior art, only the format of the standardized file can be corrected, but the text content of the standardized file cannot be written.
In a first aspect, an embodiment of the present application provides a method for processing a standardized file, including:
acquiring a generation request, wherein the generation request comprises request content;
determining a target file template corresponding to the request content;
inputting the request content into a standardized file generation model to obtain an output result, wherein the standardized file generation model is trained by using a standardized file as a training sample;
and generating the target standardized file of the generation request according to the output result and the target file template.
Optionally, the determining the target file template corresponding to the request content includes:
extracting a target standard identifier in the request content;
and determining the target file template corresponding to the target standard identifier from the corresponding relation between the pre-established file template and the standard identifier.
Optionally, the training process of the standardized file generation model includes:
the training sample is obtained, wherein the training sample comprises at least one standard type of standardized file and standard identifiers corresponding to the standardized files, and the standard identifiers are used for indicating file characteristics of the standardized files;
and sequentially inputting the standard identifiers into an initial network model, generating written file contents through the initial network model, calculating a loss value based on the written file contents and the contents in the standardized file, and training based on the loss value to obtain the standardized file generation model.
Optionally, the acquiring the training sample includes:
the training sample is obtained by utilizing python to climb a standardized file set, and initial weights of the standardized files in the standardized file set are configured according to different standard types;
after the written file content is generated through the initial network model, the method further comprises the following steps:
acquiring a preset number of written file sets recently generated by the initial network model;
calculating the duty ratio of each written file of the standard type in the written file set;
and adjusting the initial weight based on the duty ratio to obtain a target weight, and training according to a standardized file of the target weight to obtain the standardized file generation model.
Optionally, the method further comprises:
obtaining a standardized file set, wherein the standardized file set comprises at least one standardized file of a standard type;
extracting a standard identification set of each standardized file;
constructing a knowledge graph among the standardized files based on the standard identification set;
and constructing a standard file library based on the standardized file set and the knowledge graph.
Optionally, the method further comprises:
acquiring a query request, wherein the query request comprises query content;
determining a standardized file subset matched with the query content from the standard file library, wherein the standardized file subset comprises at least one candidate standardized file;
and displaying navigation information of the candidate standardized files, wherein the navigation information comprises the number of the candidate standardized files, keyword knowledge maps of specific contents in the standardized file subset, range information of each standardized file, normalized reference file information and navigation contents of text chapter title information.
Optionally, the method further comprises:
acquiring a file to be verified;
determining the similarity between the file to be verified and the standardized file;
if the obtained similarity is larger than a first preset value, determining that the verification result is not recommended to be written;
if the obtained similarity is smaller than a second preset value, determining that the verification result is recommended writing;
and if the obtained similarity is larger than a second preset value and smaller than the first preset value, determining that the verification result is the recommended modification.
In a second aspect, an embodiment of the present application provides a processing apparatus for a standardized file, including:
the acquisition module is used for acquiring a generation request, wherein the generation request comprises request content;
the determining module is used for determining a target file template corresponding to the request content;
the input module is used for inputting the request content into a standardized file generation model to obtain an output result, and the standardized file generation model is trained by using a standardized file as a training sample;
and the generating module is used for generating the target standardized file requested by the generating request according to the output result and the target file template.
In a third aspect, an embodiment of the present application provides an electronic device, including: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to execute the program stored in the memory, and implement the method for processing a standardized file according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the method for processing a standardized file according to the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the generation request is obtained, and the generation request comprises request content; determining a target file template corresponding to the request content; inputting the request content into a standardized file generation model to obtain an output result, wherein the standardized file generation model is trained by using a standardized file as a training sample; and generating a target standardized file according to the output result and the target file template. When a user needs to generate a new standardized file, a request is generated, a target file template can be determined by determining the request content in the request, then an output result generated by the standardized file generation model is used as file content, and a target standardized file is obtained based on the target file template and the output result, so that the writing of the standardized file content is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is an application scenario diagram of a method for processing a standardized file according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of a method for processing a standardized file according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method for processing a standardized file according to another embodiment of the present disclosure;
FIG. 4 is a flowchart of a method for processing a standardized file according to another embodiment of the present application;
FIG. 5 is a flowchart of a method for processing a standardized file according to another embodiment of the present disclosure;
FIG. 6 is a flowchart of a method for processing a standardized file according to another embodiment of the present application;
FIG. 7 is a block diagram of a standardized file processing apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
An embodiment of the application provides a method for processing a standardized file. Alternatively, in the embodiment of the present application, the above-described processing method of the standardized file may be applied to a hardware environment configured by the terminal 101 and the server 102 as shown in fig. 1. As shown in fig. 1, the server 102 is connected to the terminal 101 through a network, which may be used to provide services (such as application services, etc.) to the terminal or clients installed on the terminal, and a database may be provided on the server or independent of the server, for providing data storage services to the server 102, where the network includes, but is not limited to: the terminal 101 is not limited to a PC, a mobile phone, a tablet computer, or the like.
The processing method of the standardized file in the embodiment of the present application may be executed by the server 102, may be executed by the terminal 101, or may be executed by both the server 102 and the terminal 101. The terminal 101 may execute the method for processing the standardized file according to the embodiment of the present application, or may be executed by a client installed thereon.
Taking a terminal executing the method for processing the standardized file in the embodiment of the present application as an example, fig. 2 is a schematic flow chart of an alternative method for processing the standardized file according to the embodiment of the present application, as shown in fig. 2, the flow of the method may include the following steps:
step 201, obtaining a generation request, wherein the generation request comprises request content.
In some embodiments, the generation request may be generated after the user enters the requested content within a displayed request box of the terminal.
The requested content may include, but is not limited to, keywords comprising standardized files that are requested to be generated or text comprising keywords. It will be appreciated that when the requested content is text, keywords may also be extracted from the requested content using a keyword matching algorithm.
For example, the keywords of the requested content or text including the keywords may be the names of the necessary chapters within the standardized file, or the names of the chapters in general. Wherein the names of the section candidates in the standardized file may include, but are not limited to: introduction, scope, normative reference file. The generic chapter names may include, but are not limited to: general principles, general requirements, non-functional requirements, technical requirements, etc.
It can be understood that in order to enable the model to select standardized files in different fields for outputting, the accuracy and quality of the text are improved. Further, when the user provides the keywords of the present example, the user may be required to further supplement the information of the standard identifier.
For example, the requested content may be a standard identification indicating a file characteristic of the standardized file, wherein the standard identification includes an application field of the standardized file, a file type, and a class number.
Among others, file types may include, but are not limited to: international standards, national standards, industry standards, local standards, community standards, and enterprise standards. Classification numbers may include, but are not limited to, those obtained by international standards (ICS) and chinese standard document classification (Chinese Classification for Standards CCS).
Step 202, determining a target file template corresponding to the request content.
In some embodiments, by determining the target file template, the obtained target standardized file can be more in accordance with the file generation specification, and the accuracy of file generation is improved.
In an optional embodiment, the determining the target file template corresponding to the request content includes:
extracting a target standard identifier in the request content;
and determining the target file template corresponding to the target standard identifier from the corresponding relation between the pre-established file template and the standard identifier.
In some embodiments, the corresponding relation between the file template and the standard identifier is pre-constructed, and the standard identifier is extracted from the request content, so that the target file template is determined by using the corresponding relation.
And 203, inputting the request content into a standardized file generation model, and obtaining an output result, wherein the standardized file generation model is obtained by training by using a standardized file as a training sample.
In some embodiments, the standardized file generation model is obtained through training, and output results are obtained according to the request content, wherein the output results comprise other contents in the standardized file except for file template contents, including introduction, range, normalized reference file, principle/general requirement, technical/functional requirement, reference document and the like.
In an alternative embodiment, the training process of the standardized file generation model includes:
the training sample is obtained, wherein the training sample comprises at least one standard type of standardized file and standard identifiers corresponding to the standardized files, and the standard identifiers are used for indicating file characteristics of the standardized files;
and sequentially inputting the standard identifiers into an initial network model, generating written file contents through the initial network model, calculating a loss value based on the written file contents and the contents in the standardized file, and training based on the loss value to obtain the standardized file generation model.
In some embodiments, the terminal has built-in chinese natural language processing mechanisms for standardized documents for information extraction, information retrieval, text mining, and text generation. Section 1 of the Standard working guide by GB/T1.1-2020: structure and drafting rules of standardized documents, GB/T1.2-2020 section 2 of Standard working guide: standard-formulated standard class national standards such as standard file drafting rules based on ISO/IEC standardized files are used as the basis of training sets, a supervised learning mode is adopted, text content of the standard-formulated class files is used as learning content, and rules regulated by the standard-formulated class standards are used as logic judgment conditions.
Furthermore, in order to improve the accuracy of the standardized file generation model generation, the method is used for learning according to the ICS and CCS (the two classification methods are usually on the cover of the standardized file, and can be convenient for the system to acquire classification) in the training process. Because the requirements of different fields on the content of the standardized file are different, the suitability of the model is improved through field-by-field learning. In addition, in the training process, the generation of the phenomenon that the generated text preamble does not overlap the post-language due to no distinction is avoided according to classification learning of chapters, paragraphs, column items and the like of the standardized file.
Optionally, a machine translation system, a question-answering system, and a dialogue system may access a third party natural language processing open interface as part of the present application to assist in generating standardized documents.
Furthermore, since the acquired training samples come from the forms of network or user uploading, natural language processing (Natural Language Processing, abbreviated as NLP) can be adopted to process the data of the acquired training samples to improve the training effect.
In general, the electronic form of a standardized file is generally composed of three types: one type is a native PDF, i.e., a PDF formed by conversion of an editable document such as Word; one type is an image PDF, which is formed by photographing, scanning a part and the like; one type is an editable document source file, with a minimum of such files. Based on the above, when the original PDF and the image PDF are input as a text set, PDF-to-Chinese text operation is performed first, and PDF-to-text and Chinese translation operation is performed for international standardized files which are not converted into domestic standards.
In an alternative embodiment, the acquiring the training sample includes:
the training sample is obtained by utilizing python to climb a standardized file set, and initial weights of the standardized files in the standardized file set are configured according to different standard types;
after the written file content is generated through the initial network model, the method further comprises the following steps:
acquiring a preset number of written file sets recently generated by the initial network model;
calculating the duty ratio of each written file of the standard type in the written file set;
and adjusting the initial weight based on the duty ratio to obtain a target weight, and training according to a standardized file of the target weight to obtain the standardized file generation model.
In some embodiments, in order to prevent the model from learning a low-quality, poorly available, standard term non-normative data training set, the invention designs a module of initial weights and target weights, which represent how much of the data is converted into a part of the model itself when learning the corresponding level of criteria.
In general, the order of the standardization and importance of the standardized documents and the accuracy of the standard terms is as follows: international standardized documents (including public specifications PAS, technical specifications TS, technical reports TR, industry technical protocols ITA, standards, etc.), national standards, industry standards, local standards, group standards, enterprise standards, and at the same time, learning only national standards, industry standards face the problems that the number of training samples is small and the coverage area is not comprehensive enough, so that it is necessary to learn all types of standards.
Therefore, the initial weights of model learning are set in this application as follows: the international standardized documents and/or national standard weight is 50%, the industry standard weight is 30%, the local standard weight is 10%, the group standard=8% and the enterprise standard weight is 2%. It can be understood that a phenomenon that the number of current standards of a certain level is 0 may exist in a certain field, and at this time, initial correction can be performed by setting initial weights, and the correction method is to fill weights for the standard with the highest level among the remaining level standards which are not 0 in the standardized subset, so as to fill up the percentage. For example, in the blockchain field, the number of current national standards and international standards is 0, the remaining weights are added to be 50%, at this time, initial weight correction is performed, and the initial weights in the blockchain field are filled to be 80% of industry standards, 10% of places, 8% of communities and 2% of enterprises.
Further, no matter whether weight filling occurs, when the number (N) of the highest level standards is extremely small, that is, the number is less than 30% of the level standards (N) with the largest number in the field standards, for example, industry standard 1 item, group standard 10 item, initial weight (a) correction is performed, and the correction method is as follows:
the floating weight W is:
modified highest ranking criterion weight = a n -W;
The highest number of corrected hierarchical standard weights=a N +W。
Since models are continually learned and advanced, learning and training according to the weights may not be accurate, and thus the present invention designs target weights that are based on learning over time, and the added correction class weights. After the model is trained for a period of time, x (x is a positive integer greater than 1) fields are selected automatically and randomly, standardized files of m (m is a positive integer greater than 1) natural languages are generated automatically, and the size of the initial weight is adjusted automatically by comparing the standard expression with the standard expression of each grade.
Illustratively, in the case where the standard number of all levels is not 0 and no initial weight correction is performed, the corrected normalized file weight ratio of each level may be between the following ranges:
international standard = national standard (40% -60%), industry standard (20% -40%), local standard (1% -20%), corporate standard (1% -18%), enterprise standard (1% -12%).
Further, in order to prevent the model from being excessively corrected, the adjustment ratio of the target weight is set to be between ±10%, and when the adjustment ratio approaches 0, it is set to be 1%.
And 204, generating the target standardized file of the generation request according to the output result and the target file template.
In some embodiments, when a user needs to generate a new standardized file, a request is generated, a target file template can be determined by determining the request content in the request, then, an output result generated by the standardized file generation model is used as file content, and a target standardized file is obtained based on the target file template and the output result, so that writing of the standardized file content is realized.
It can be understood that in the training process of the standardized file generation model, the standardized file is classified, learned and trained according to the chapter, the paragraph and the column item, so that the output result of the standardized file generation model can be sequentially output according to the chapter, the paragraph and the column item each time, and the output result is configured in the file template.
In an alternative embodiment, the method for processing the standardized file of the present application further includes:
obtaining a standardized file set, wherein the standardized file set comprises at least one standardized file of a standard type;
extracting a standard identification set of each standardized file;
constructing a knowledge graph among the standardized files based on the standard identification set;
and constructing a standard file library based on the standardized file set and the knowledge graph.
In some embodiments, by analyzing a huge amount of standardized files and building a standard file library, feedback can be given in time when a user has the requirement of inquiring the standardized files, so that the situations of inaccurate results and insufficient search text caused by searching from a webpage by the user are avoided.
The knowledge graph can use NLP to cluster all the standard marks and keywords in specific texts in any field.
In an alternative embodiment, the method for processing the standardized file of the present application further includes:
acquiring a query request, wherein the query request comprises query content;
determining a standardized file subset matched with the query content from the standard file library, wherein the standardized file subset comprises at least one candidate standardized file;
and displaying navigation information of the candidate standardized files, wherein the navigation information comprises the number of the candidate standardized files, sub-knowledge maps among the candidate standardized files and navigation content of each standardized file, and the sub-knowledge maps are established based on standard identification of the candidate standardized files.
In some embodiments, the navigation content may include, but is not limited to, scope information, normalized reference file information, text chapter header information. Through the query request issued by the user, a standardized file subset meeting the query request is determined from the standard file library, and the navigation information of the standardized file subset is visually displayed, so that the user can more intuitively know the standardized file meeting the requirements of the user.
By way of example, if the query content includes the keyword "blockchain", then all blockchain current standards (i.e. standardized file subsets) can be queried, specific contents related in each standardized file are processed through natural language, and abstractions, such as "intelligent contracts", "props" are performed, and the map of the query field is displayed, and is displayed as abstract keywords at the front end, and then a user can click on the displayed keywords, and after clicking, information of all specific standards and corresponding items related at the back is displayed.
The navigation information may include the number of all related domain standardized files corresponding to the domain, keywords and/or standard text required by the user, the emphasis point of the related domain standard text content (the emphasis point may be obtained by the text similarity), the keyword map formulated by the related domain, the number of the adjacent domain standardized files, and so on.
Further, in addition to keyword queries, key content queries may also be included. In this application, it is classified into an accurate query and a fuzzy query. The exact query may only give a standardized file with explicit occurrence of the required text. The fuzzy query module is divided into a domain fuzzy query (abstracting the required keyword information into the domain specified by ICS and CCS, listing all relevant standardized file texts) and a text fuzzy query (automatically screening standardized file texts to which the similar text belongs by utilizing NLP).
Referring to fig. 3, when the processing method of the standardized file in the application is adopted for query, the method specifically may include: submitting keywords/fields/technical details by a user; standardized text intent understanding; generating a text data search field; generating a preliminary standardized file text according to the importing rules and the standardized file text which is learned and referred; semantic modification; the reference text is provided to the user.
In an alternative embodiment, the method for processing the standardized file of the present application further includes:
acquiring a file to be verified;
determining the similarity between the file to be verified and the standardized file;
if the obtained similarity is larger than a first preset value, determining that the verification result is not recommended to be written;
if the obtained similarity is smaller than a second preset value, determining that the verification result is recommended writing;
and if the obtained similarity is larger than a second preset value and smaller than the first preset value, determining that the verification result is the recommended modification.
In some embodiments, the file to be approved may be a target standardized file obtained by the method in the steps 201 to 204, or may be a file uploaded after the user writes the file by himself.
For example, when the document to be verified is a target standardized document, referring to fig. 4, the number of texts learned behind each text segment of the target standardized document and the percentage of the number of texts to all the standard numbers referred to in generating the standardized document are determined, and the percentage is suggested as the similarity.
For text paragraphs with the percentage of more than 50%, the system gives a red alarm prompt of 'do not suggest writing such directions', and generates standard names and numbers corresponding to standardized files, so that standard writing users can conveniently and normally refer to the standards; for text paragraphs with less than 50% and more than 30%, the system gives yellow prompts with properties such as repeatability and the like to be considered in the writing process; for text paragraphs with less than 30% references, the system gives a green hint that the writing can be done.
It will be appreciated that the above percentages may be set according to the actual situation and are not limited herein.
Further, if the user does not need the standard of the domain/technical details, standard texts of the related domain are given to the user so that the user can be used as a reference, and the problem of repeatedly formulating standard contents is avoided.
For example, referring to fig. 5, when a document to be verified is a written document uploaded after the user writes itself, the similarity between each text of the written document and the standardized document in the standard document library is determined, and advice is made based on the similarity. The specific proposal mode can refer to the process when the file to be verified is the target standardized file, and is not repeated here.
In one embodiment, referring to fig. 6, the method for processing the standardized file includes: recognizing the standardized file by using computer vision CV; automatically crawling a required standardized file by using Python; performing text conversion and machine translation on the crawled standardized file; setting adjustable proportions of initial weights, target weights and weights of standardized files of different grades, importing standardized file formulation rules as logic judgment rules, and reasonably setting proportions of a training set, a verification set and a test set; importing ICS and CCS domain classification, and importing different levels of standardized file chapter template information; machine learning and model generation, correction and use are started.
The processing method of the standardized file carries out document conversion and translation operation, and correctly learns standard terms through natural language processing. By learning a large number of files such as PDF images, pictures, and documents of standardized files using computer vision CV, it is possible to accurately recognize what is the standardized file.
The method and the system can also solve the problem that standardized file inquiry is not convenient, solve the problem that the prior art can only independently inquire a certain level of standards (such as country, industry and the like), achieve the aim of synchronously fusing national, industry, local, group and enterprise standards, and provide all published and issued standard text information aiming at inquiry fields, technical details and standard items submitted by users from the standard longitudinal dimension. In addition, machine learning is performed by taking a huge amount of standardized file texts as a training set, standard writing specifications are synchronously input as logic judgment, so that all chapters and column items in the standard texts can provide contents such as navigation in the standardized file field, navigation of standardized file reference texts, description of writing rules of the chapters and the like, and meanwhile, navigation of standardized file making suggestions can be provided, and the problems of repeated standard making and low usability existing at present are effectively avoided. In order to ensure the scientificity of the initial weight, a target weight and a correction range are designed, so that the system can correct the data weight in a certain range.
In addition, the processing mode of making suggestions for the standardized file in the method can provide quantitative editing suggestions for the standard writer to edit the standard direction and specific content details, and the system can give suggestions whether to recommend editing for published standard paragraph contents with different percentage references. The problems of repeatability, repeated formulation of the same detail content and low availability in the standard formulation process can be effectively reduced.
Based on the same conception, the embodiment of the present application provides a standardized document processing device, and the specific implementation of the device may be referred to the description of the embodiment of the method, and the repetition is omitted, as shown in fig. 7, where the device mainly includes:
an obtaining module 701, configured to obtain a generation request, where the generation request includes a request content;
a determining module 702, configured to determine a target file template corresponding to the requested content;
the input module 703 is configured to input the request content into a standardized file generation model, to obtain an output result, where the standardized file generation model is obtained by training using a standardized file as a training sample;
and the generating module 704 is configured to generate a target standardized file requested by the generation request according to the output result and the target file template.
Based on the same concept, the embodiment of the application also provides an electronic device, as shown in fig. 8, where the electronic device mainly includes: a processor 801, a memory 802, and a communication bus 803, wherein the processor 801 and the memory 802 complete communication with each other through the communication bus 803. The memory 802 stores a program executable by the processor 801, and the processor 801 executes the program stored in the memory 802 to implement the following steps:
acquiring a generation request, wherein the generation request comprises request content;
determining a target file template corresponding to the request content;
inputting the request content into a standardized file generation model to obtain an output result, wherein the standardized file generation model is trained by using a standardized file as a training sample;
and generating the target standardized file of the generation request according to the output result and the target file template.
The communication bus 803 mentioned in the above-mentioned electronic device may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated to PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated to EISA) bus, or the like. The communication bus 803 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
The memory 802 may include random access memory (Random Access Memory, simply RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor 801.
The processor 801 may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (Digital Signal Processing, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In yet another embodiment of the present application, there is also provided a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to perform the method of processing a standardized file described in the above embodiments.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, by a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, microwave, etc.) means from one website, computer, server, or data center to another. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape, etc.), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method of processing a standardized document, comprising:
acquiring a generation request, wherein the generation request comprises request content;
determining a target file template corresponding to the request content;
inputting the request content into a standardized file generation model to obtain an output result, wherein the standardized file generation model is trained by using a standardized file as a training sample;
and generating the target standardized file of the generation request according to the output result and the target file template.
2. The method for processing the standardized file according to claim 1, wherein the determining the target file template corresponding to the requested content includes:
extracting a target standard identifier in the request content;
and determining the target file template corresponding to the target standard identifier from the corresponding relation between the pre-established file template and the standard identifier.
3. The method for processing the standardized file according to claim 1, wherein the training process of the standardized file generation model comprises the following steps:
the training sample is obtained, wherein the training sample comprises at least one standard type of standardized file and standard identifiers corresponding to the standardized files, and the standard identifiers are used for indicating file characteristics of the standardized files;
and sequentially inputting the standard identifiers into an initial network model, generating written file contents through the initial network model, calculating a loss value based on the written file contents and the contents in the standardized file, and training based on the loss value to obtain the standardized file generation model.
4. A method of processing a standardized file according to claim 3, wherein the acquiring the training sample comprises:
the training sample is obtained by utilizing python to climb a standardized file set, and initial weights of the standardized files in the standardized file set are configured according to different standard types;
after the written file content is generated through the initial network model, the method further comprises the following steps:
acquiring a preset number of written file sets recently generated by the initial network model;
calculating the duty ratio of each written file of the standard type in the written file set;
and adjusting the initial weight based on the duty ratio to obtain a target weight, and training according to a standardized file of the target weight to obtain the standardized file generation model.
5. The method for processing a standardized file according to claim 1, further comprising:
obtaining a standardized file set, wherein the standardized file set comprises at least one standardized file of a standard type;
extracting a standard identification set of each standardized file;
constructing a knowledge graph among the standardized files based on the standard identification set;
and constructing a standard file library based on the standardized file set and the knowledge graph.
6. The method for processing a standardized file of claim 5 further comprising:
acquiring a query request, wherein the query request comprises query content;
determining a standardized file subset matched with the query content from the standard file library, wherein the standardized file subset comprises at least one candidate standardized file;
and displaying navigation information of the candidate standardized files, wherein the navigation information comprises the number of the candidate standardized files, sub-knowledge maps among the candidate standardized files and navigation content of each standardized file, and the sub-knowledge maps are established based on standard identification of the candidate standardized files.
7. The method for processing a standardized file according to claim 1, further comprising:
acquiring a file to be verified;
determining the similarity between the file to be verified and the standardized file;
if the obtained similarity is larger than a first preset value, determining that the verification result is not recommended to be written;
if the obtained similarity is smaller than a second preset value, determining that the verification result is recommended writing;
and if the obtained similarity is larger than a second preset value and smaller than the first preset value, determining that the verification result is the recommended modification.
8. A standardized document processing apparatus, comprising:
the acquisition module is used for acquiring a generation request, wherein the generation request comprises request content;
the determining module is used for determining a target file template corresponding to the request content;
the input module is used for inputting the request content into a standardized file generation model to obtain an output result, and the standardized file generation model is trained by using a standardized file as a training sample;
and the generating module is used for generating the target standardized file requested by the generating request according to the output result and the target file template.
9. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to execute a program stored in the memory, and implement the method for processing a standardized file according to any one of claims 1 to 7.
10. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method of processing a standardized file according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310380755.8A CN116522885A (en) | 2023-04-10 | 2023-04-10 | Standardized file processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310380755.8A CN116522885A (en) | 2023-04-10 | 2023-04-10 | Standardized file processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116522885A true CN116522885A (en) | 2023-08-01 |
Family
ID=87393206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310380755.8A Pending CN116522885A (en) | 2023-04-10 | 2023-04-10 | Standardized file processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116522885A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118468811A (en) * | 2024-07-15 | 2024-08-09 | 江苏中威科技软件系统有限公司 | Method for realizing format file standardization through machine learning |
-
2023
- 2023-04-10 CN CN202310380755.8A patent/CN116522885A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118468811A (en) * | 2024-07-15 | 2024-08-09 | 江苏中威科技软件系统有限公司 | Method for realizing format file standardization through machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11409813B2 (en) | Method and apparatus for mining general tag, server, and medium | |
US11797607B2 (en) | Method and apparatus for constructing quality evaluation model, device and storage medium | |
US11972201B2 (en) | Facilitating auto-completion of electronic forms with hierarchical entity data models | |
WO2021114810A1 (en) | Graph structure-based official document recommendation method, apparatus, computer device, and medium | |
CN110019732B (en) | Intelligent question answering method and related device | |
CN110377558B (en) | Document query method, device, computer equipment and storage medium | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
Chen et al. | Research on personalized recommendation hybrid algorithm for interactive experience equipment | |
CN111563384A (en) | Evaluation object identification method and device for E-commerce products and storage medium | |
CN112307336B (en) | Hot spot information mining and previewing method and device, computer equipment and storage medium | |
CN106407316B (en) | Software question and answer recommendation method and device based on topic model | |
AU2022204589B2 (en) | Multiple input machine learning framework for anomaly detection | |
US20230134989A1 (en) | System and method for building document relationships and aggregates | |
US11379527B2 (en) | Sibling search queries | |
CN104881428A (en) | Information graph extracting and retrieving method and device for information graph webpages | |
CN116522885A (en) | Standardized file processing method and device, electronic equipment and storage medium | |
KR102285232B1 (en) | Morphology-Based AI Chatbot and Method How to determine the degree of sentence | |
CN114117239A (en) | House resource pushing method, device and equipment | |
CN117216226A (en) | Knowledge positioning method, device, storage medium and equipment | |
CN114153946A (en) | Intelligent retrieval method, device, equipment and storage medium | |
US20230004570A1 (en) | Systems and methods for generating document score adjustments | |
CN117931858B (en) | Data query method, device, computer equipment and storage medium | |
CN114372460B (en) | Character discriminating method, apparatus, electronic device and storage medium | |
WO2024055582A1 (en) | Optimization method and apparatus for question-and-answer knowledge base | |
CN118172778A (en) | Picture sensitive character detection method using mask language model modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |