CN113095061B - Method, system, device and storage medium for extracting document header - Google Patents

Method, system, device and storage medium for extracting document header Download PDF

Info

Publication number
CN113095061B
CN113095061B CN202110344640.4A CN202110344640A CN113095061B CN 113095061 B CN113095061 B CN 113095061B CN 202110344640 A CN202110344640 A CN 202110344640A CN 113095061 B CN113095061 B CN 113095061B
Authority
CN
China
Prior art keywords
document
hidden markov
markov model
header
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110344640.4A
Other languages
Chinese (zh)
Other versions
CN113095061A (en
Inventor
蓝建敏
李观春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Excellence Information Technology Co ltd
Original Assignee
Excellence Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Excellence Information Technology Co ltd filed Critical Excellence Information Technology Co ltd
Priority to CN202110344640.4A priority Critical patent/CN113095061B/en
Publication of CN113095061A publication Critical patent/CN113095061A/en
Application granted granted Critical
Publication of CN113095061B publication Critical patent/CN113095061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a method, a system, a device and a storage medium for extracting a document head based on a hidden Markov model; the extraction method comprises the steps of obtaining a document text, wherein the document text comprises a document title, a document genre and document content; extracting the document head of the document text by using the trained hidden Markov model; and obtaining the document head output by the trained hidden Markov model. According to the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised. The application can be widely applied to the technical field of document head extraction.

Description

Method, system, device and storage medium for extracting document header
Technical Field
The application relates to the technical field of document head extraction, in particular to a method, a system, a device and a storage medium for extracting document heads based on a hidden Markov model.
Background
The official document is a written material (document) of legal authorities and organizations in official activities, and a common official document format generally consists of attributes such as a part number, a secret level, a secret period, an emergency degree, an office mark, an office number, a signer, a title, a main delivery office, a text, an accessory description, an office signature, an office date, a seal, a remark, an accessory, a copying office, a printing date, a page number and the like, namely an official document head. The document header extraction refers to extracting an unstructured document into structured data according to the document attribute field by using an information extraction technology.
At present, technologies such as rule matching, position positioning and the like are generally adopted to extract document attributes such as titles, confidentiality deadlines and the like of documents from document files. However, the method for extracting the document header based on rule matching needs to rely on a large number of rules, and the manual arrangement of the rules has the disadvantages of large workload, narrow coverage of examples and high manual learning cost; however, in practice, many documents do not strictly follow the national document standard, resulting in low quality of the attributes of the extracted documents.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the application provides a method, a system, a device and a storage medium for extracting a document head based on a hidden Markov model.
The technical scheme adopted by the application is as follows:
in one aspect, an embodiment of the present application includes a method for extracting a document header based on a hidden markov model, including:
obtaining a document text, wherein the document text comprises a document title, a document genre and document content;
extracting the document head of the document text by using the trained hidden Markov model;
and obtaining the document head output by the trained hidden Markov model.
Further, the extraction method further includes:
and carrying out error correction processing on the extracted document header contents one by utilizing the trained document header correction model.
Further, the step of performing error correction processing on the extracted document heads one by using the trained document head correction model includes:
checking whether the content of the document header is wrong or not by using a trained document header correction model;
and if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.
Further, the step of extracting the document header from the document text by using the trained hidden markov model specifically includes:
acquiring corresponding brief head extraction attributes according to the genres of the brief texts;
taking the document text as observation input;
mapping the document text into words in a word set by using a maximum path similarity method;
outputting the mapped vocabulary sequence by using a Viterbi algorithm;
and obtaining the best state marking sequence from the vocabulary series and outputting the best state marking sequence.
Further, the best state marking sequence is a document header attribute sequence.
Further, the extraction method further comprises training the hidden Markov model, including:
constructing a training sample set;
inputting the training sample set into a hidden Markov model for training;
learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model;
and outputting a trained hidden Markov model according to the parameters.
Further, the step of constructing a training sample set specifically includes:
collecting document corpus of various genres and analyzing document header attribute;
determining the document head extraction attribute of documents of various genres;
and carrying out sequence labeling on the document head extraction attribute to obtain a training sample set.
On the other hand, the embodiment of the application also comprises a document head extraction system based on a hidden Markov model, which comprises:
the first acquisition module is used for acquiring a document text, wherein the document text comprises a document title, a document genre and document content;
the extraction module is used for extracting the document head of the document text by using the trained hidden Markov model;
and the second acquisition module acquires the document head output by the trained hidden Markov model.
On the other hand, the embodiment of the application also comprises a document head extraction device based on a hidden Markov model, which comprises:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the extraction method.
In another aspect, embodiments of the present application further include a computer readable storage medium having stored thereon a processor executable program for implementing the extraction method when executed by a processor.
The beneficial effects of the application are as follows:
according to the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a flowchart illustrating a method for extracting a document header based on a hidden Markov model according to an embodiment of the present application;
FIG. 2 is a flow chart of the document header content correction according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for extracting a document header based on a hidden Markov model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a hidden Markov model-based document head extraction procedure according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a document head extraction device based on a hidden markov model according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
In the description of the present application, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present application.
In the description of the present application, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
Embodiments of the present application will be further described below with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present application includes a method for extracting a document header based on a hidden markov model, including:
s1, acquiring a document text, wherein the document text comprises a document title, a document genre and document content;
s2, extracting a document header of the document text by using the trained hidden Markov model;
s3, acquiring the document head output by the trained hidden Markov model.
Regarding step S1, a text requiring to extract document header information is obtained, and the text content includes a document title, a document genre, and document content.
Further, after step S1, that is, after the step of obtaining the document text, the extracting method further includes:
s101, carrying out blank line feed, line feed and sentence breaking merging pretreatment on the document text.
In this embodiment, after preprocessing is completed, the preprocessed document text is input into a trained hidden markov model.
Specifically, step S2, that is, the step of extracting the document header from the document text by using the trained hidden markov model, specifically includes:
s201, acquiring corresponding document head extraction attributes according to the genre of the document text;
s202, taking the document text as observation input;
s203, mapping the document text into vocabulary in a vocabulary set by using a maximum path similarity method;
s204, outputting the mapped vocabulary sequence by using a Viterbi algorithm;
s205, acquiring the optimal state marking sequence from the vocabulary series and outputting the optimal state marking sequence.
In this embodiment, the trained hidden markov model is used to extract the document header from the inputted document text. Firstly, according to the document genre, obtaining the corresponding document head extraction attribute as a determined state; the input text is used as observation input, the maximum path similarity method is used for mapping the input text into vocabulary concentrated vocabulary, the Viterbi algorithm is used for outputting the mapped vocabulary sequence, and the optimal state marking sequence, namely the document header attribute sequence, is output.
Optionally, after the obtaining the document header output by the trained hidden markov model, the extracting method further includes:
s4, utilizing the trained document header correction model to perform error correction processing on the extracted document header content one by one.
In this embodiment, the error correction processing for the extracted header information one by using the trained header correction model specifically includes the following steps:
s401, checking whether the content of the document header is wrong or not by using a trained document header correction model;
s402, if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.
Referring to fig. 2, the document header content correction flow is as follows:
1) Inputting the content of the document header to be corrected;
and after the document header extraction model is applied to extract the document header information, the document header information contents are input one by one, and error correction and correction processing are carried out.
2) Applying a trained document head correction model to judge whether the document head correction model is error content;
and (3) applying a document header correction model to check whether the input content is wrong document header information.
3) Correcting the error content;
and if the content is the error content, calculating the similarity with the correct content by applying the cosine distance of a similarity algorithm through the verification in the last step, and obtaining the most similar correct content as the correction content.
For a plurality of different texts or short texts, how to calculate the similarity between the texts, one good way is to map words in the texts to a vector space to form a mapping relation between words in the texts and vector data, and calculate the similarity between the texts by calculating the difference of a plurality of or a plurality of different vectors. The cosine distance, also called cosine similarity, uses the cosine value of the angle between two vectors in the vector space as the measure of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, i.e., the more similar the two vectors are, which is called "cosine similarity".
4) Outputting the correct document header content, namely outputting corrected text content.
In this embodiment, the extraction method further includes training a hidden markov model, including:
p1, constructing a training sample set;
p2, inputting the training sample set into a hidden Markov model for training;
p3, learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model;
and P4, outputting a trained hidden Markov model according to the parameters.
In this embodiment, the hidden markov model is mainly trained, and first, the hidden markov model is described, and the hidden markov model (Hidden Markov Model, HMM) is a probability model of a time series, describing an observation sequence composed of one markov chain composed of hidden state sequences and observations generated by each state thereof. The states in the Markov model are invisible or uncertain, only the state-generated observations can be directly observed, and the values of the states are determined by training the model parameters and decoding the model parameters by an identification algorithm to obtain a state sequence.
In a hidden Markov model, a state sequence is determined by a state transition probability distribution A and an initial state probability distribution pi, and an observation sequence is determined by a generated state sequence and an observation probability distribution. Each state emits an observation through an observation probability distribution B. The state transition probabilities a, B and pi of the initial probability distribution of states may be learned from training data.
The distribution of each document attribute to be extracted in the document text has a certain order relation, in the hidden Markov model, the document header attribute forms a state set, the value to be extracted is an observable text sequence, the text sequence is marked to form a training set, and the training set is learned by adopting an ML algorithm to obtain trained model parameters A, B and pi.
Step P1, namely the step of constructing a training sample set, specifically includes:
p101 collecting document corpus of various genres and analyzing document header attribute;
p102. determining the document head extraction attribute of the documents of various genres;
and P103, carrying out sequence labeling on the document head extraction attribute to obtain a training sample set.
In the embodiment, collecting document corpus of various genres, analyzing document header attributes, and determining various document header extraction attributes such as title, security level, confidentiality deadline, emergency degree, document issuing authority mark and the like; and then, sequence labeling is carried out on the document head attributes of a certain scale, so that a training sample set can be obtained.
Specifically, the hidden markov model-based document head extraction method may refer to fig. 3, where the initial states in fig. 3 include states of title, security class, security deadline, urgency, issuing authority flag, and the like. The training sample is input into the hidden Markov model after pretreatment to be a training part for the hidden Markov model; and inputting the text to be extracted into a trained hidden Markov model for document head extraction after preprocessing, and finally outputting an optimal state marking sequence, wherein the optimal state marking sequence is a document head attribute sequence, and extracting an unstructured document into structured data according to a document head attribute field by using the document head extraction method based on the hidden Markov model.
Referring to fig. 4, the present embodiment further provides a document head extraction program based on a hidden markov model, for implementing the document head extraction method shown in fig. 1, where the program specifically includes a labeling module, a learning and training module, a model library, a text preprocessing module, and a document head extraction module, where the labeling module is used to label attributes of a document corpus; the learning and training module is used for learning and training the marked corpus to obtain a document head extraction model, namely a hidden Markov model; the model library is used for storing the model obtained by training; the text preprocessing module is used for carrying out blank line feed and line break sentence breaking merging processing on the input document text; and the document header extraction module is used for extracting document header attributes of the preprocessed text.
The document head extraction method based on the hidden Markov model has the following technical effects:
according to the embodiment of the application, the trained hidden Markov model is utilized to extract the document header of the document text, so that the manual learning cost can be reduced, and the accuracy of extracting the document header can be improved; meanwhile, errors in the document header can be automatically revised.
Referring to fig. 5, the embodiment of the present application further includes a document head extraction device 200 based on a hidden markov model, which specifically includes:
at least one processor 210;
at least one memory 220 for storing at least one program;
the at least one program, when executed by the at least one processor 210, causes the at least one processor 210 to implement the method as shown in fig. 1.
The memory 220 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs. Memory 220 may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 220 may optionally include remote memory located remotely from processor 210, which may be connected to processor 210 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It will be appreciated that the device structure shown in fig. 5 is not limiting of the device 200 and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.
In the apparatus 200 shown in fig. 5, the processor 210 may retrieve the program stored in the memory 220 and perform, but is not limited to, the steps of the embodiment shown in fig. 1.
The above-described embodiment of the apparatus 200 is merely illustrative, in which the units illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiment.
The embodiment of the present application also provides a computer-readable storage medium storing a processor-executable program for implementing the method shown in fig. 1 when executed by a processor.
Embodiments of the present application also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
It is to be understood that all or some of the steps, systems, and methods disclosed above may be implemented in software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
The embodiments of the present application have been described in detail with reference to the accompanying drawings, but the present application is not limited to the above embodiments, and various changes can be made within the knowledge of one of ordinary skill in the art without departing from the spirit of the present application.

Claims (4)

1. The document head extraction method based on the hidden Markov model is characterized by comprising the following steps of:
obtaining a document text, wherein the document text comprises a document title, a document genre and document content;
extracting the document head of the document text by using the trained hidden Markov model;
obtaining a document head output by the trained hidden Markov model;
the method for extracting the document header of the document text by using the trained hidden Markov model comprises the following steps:
acquiring corresponding brief head extraction attributes according to the genres of the brief texts;
taking the document text as observation input;
mapping the document text into words in a word set by using a maximum path similarity method;
outputting the mapped vocabulary sequence by using a Viterbi algorithm;
obtaining and outputting an optimal state marking sequence from the vocabulary series, wherein the optimal state marking sequence is a document header attribute sequence;
the extraction method further comprises training the hidden Markov model, and comprises the following steps:
constructing a training sample set;
inputting the training sample set into a hidden Markov model for training;
learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model;
outputting a trained hidden Markov model according to the parameters;
wherein the constructing a training sample set includes:
collecting document corpus of various genres and analyzing document header attribute;
determining the document head extraction attribute of documents of various genres, including title, security level, security period, emergency degree and document issuing authority mark;
performing sequence labeling on the document head extraction attribute to obtain a training sample set;
the extraction method further comprises the step of performing error correction processing on extracted document header contents one by using a trained document header correction model, and comprises the following steps:
checking whether the content of the document header is wrong or not by using a trained document header correction model;
and if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.
2. A hidden markov model-based document head extraction system comprising:
the first acquisition module is used for acquiring a document text, wherein the document text comprises a document title, a document genre and document content;
the extraction module is used for extracting the document head of the document text by using the trained hidden Markov model;
the second acquisition module acquires a document head output by the trained hidden Markov model;
the implementation process of the extraction module comprises the following steps:
acquiring corresponding brief head extraction attributes according to the genres of the brief texts;
taking the document text as observation input;
mapping the document text into words in a word set by using a maximum path similarity method;
outputting the mapped vocabulary sequence by using a Viterbi algorithm;
obtaining and outputting an optimal state marking sequence from the vocabulary series, wherein the optimal state marking sequence is a document header attribute sequence;
the extraction system further comprises a training module for: constructing a training sample set; inputting the training sample set into a hidden Markov model for training; learning and training by adopting an ML algorithm to obtain parameters of a hidden Markov model; outputting a trained hidden Markov model according to the parameters;
wherein the constructing a training sample set includes:
collecting document corpus of various genres and analyzing document header attribute;
determining the document head extraction attribute of documents of various genres, including title, security level, security period, emergency degree and document issuing authority mark;
performing sequence labeling on the document head extraction attribute to obtain a training sample set;
the extraction system also comprises a module for carrying out error correction processing on extracted document header contents one by utilizing a trained document header correction model, and the implementation process of the module comprises the following steps: checking whether the content of the document header is wrong or not by using a trained document header correction model; and if the content is the error content, calculating to obtain the correct document header content by using a cosine similarity algorithm according to the error content.
3. A hidden markov model-based document head extraction device, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the decimation method as claimed in claim 1.
4. Computer readable storage medium, characterized in that it has stored thereon a processor executable program for implementing the extraction method according to claim 1 when being executed by a processor.
CN202110344640.4A 2021-03-31 2021-03-31 Method, system, device and storage medium for extracting document header Active CN113095061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110344640.4A CN113095061B (en) 2021-03-31 2021-03-31 Method, system, device and storage medium for extracting document header

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110344640.4A CN113095061B (en) 2021-03-31 2021-03-31 Method, system, device and storage medium for extracting document header

Publications (2)

Publication Number Publication Date
CN113095061A CN113095061A (en) 2021-07-09
CN113095061B true CN113095061B (en) 2023-08-29

Family

ID=76671834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110344640.4A Active CN113095061B (en) 2021-03-31 2021-03-31 Method, system, device and storage medium for extracting document header

Country Status (1)

Country Link
CN (1) CN113095061B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108804414A (en) * 2018-05-04 2018-11-13 科沃斯商用机器人有限公司 Text modification method, device, smart machine and readable storage medium storing program for executing
CN111209865A (en) * 2020-01-06 2020-05-29 中科鼎富(北京)科技发展有限公司 File content extraction method and device, electronic equipment and storage medium
CN112101032A (en) * 2020-08-31 2020-12-18 广州探迹科技有限公司 Named entity identification and error correction method based on self-distillation
CN112364172A (en) * 2020-10-16 2021-02-12 上海晏鼠计算机技术股份有限公司 Method for constructing knowledge graph in government official document field
CN112445915A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Document map extraction method and device based on machine learning and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7188064B2 (en) * 2001-04-13 2007-03-06 University Of Texas System Board Of Regents System and method for automatic semantic coding of free response data using Hidden Markov Model methodology
US7469251B2 (en) * 2005-06-07 2008-12-23 Microsoft Corporation Extraction of information from documents
US10740545B2 (en) * 2018-09-28 2020-08-11 International Business Machines Corporation Information extraction from open-ended schema-less tables

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572849A (en) * 2014-12-17 2015-04-29 西安美林数据技术股份有限公司 Automatic standardized filing method based on text semantic mining
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108804414A (en) * 2018-05-04 2018-11-13 科沃斯商用机器人有限公司 Text modification method, device, smart machine and readable storage medium storing program for executing
CN111209865A (en) * 2020-01-06 2020-05-29 中科鼎富(北京)科技发展有限公司 File content extraction method and device, electronic equipment and storage medium
CN112101032A (en) * 2020-08-31 2020-12-18 广州探迹科技有限公司 Named entity identification and error correction method based on self-distillation
CN112364172A (en) * 2020-10-16 2021-02-12 上海晏鼠计算机技术股份有限公司 Method for constructing knowledge graph in government official document field
CN112445915A (en) * 2021-01-28 2021-03-05 京华信息科技股份有限公司 Document map extraction method and device based on machine learning and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于隐马尔可夫模型的中文科研论文信息抽取;于江德 等;计算机工程(第19期);第190-192页 *

Also Published As

Publication number Publication date
CN113095061A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
US10650192B2 (en) Method and device for recognizing domain named entity
JP4829920B2 (en) Form automatic embedding method and apparatus, graphical user interface apparatus
KR101114194B1 (en) Assisted form filling
CN110232340B (en) Method and device for establishing video classification model and video classification
CN110136747A (en) A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
US20150213333A1 (en) Method and device for realizing chinese character input based on uncertainty information
CA3048356A1 (en) Unstructured data parsing for structured information
US20190114313A1 (en) User interface for contextual document recognition
CN110968730B (en) Audio mark processing method, device, computer equipment and storage medium
CN110147545B (en) Method and system for structured output of text, storage medium and computer equipment
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN113095061B (en) Method, system, device and storage medium for extracting document header
US20170351661A1 (en) System and method for understanding text using a translation of the text
US11431472B1 (en) Automated domain language parsing and data extraction
CN115168345A (en) Database classification method, system, device and storage medium
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN114065762A (en) Text information processing method, device, medium and equipment
CN115270768A (en) Method and equipment for determining target key words to be corrected in text
CN114510925A (en) Chinese text error correction method, system, terminal equipment and storage medium
CA3156204A1 (en) Domain based text extraction
US10944569B2 (en) Comparison and validation of digital content using contextual analysis
US20240020473A1 (en) Domain Based Text Extraction
CN114049528B (en) Brand name identification method and equipment
CN116341554B (en) Training method of named entity recognition model for biomedical text
CN114282518A (en) Document field element extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant