CN112668316A - word document key information extraction method - Google Patents

word document key information extraction method Download PDF

Info

Publication number
CN112668316A
CN112668316A CN202011290565.XA CN202011290565A CN112668316A CN 112668316 A CN112668316 A CN 112668316A CN 202011290565 A CN202011290565 A CN 202011290565A CN 112668316 A CN112668316 A CN 112668316A
Authority
CN
China
Prior art keywords
word document
key information
file
paragraph
paragraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011290565.XA
Other languages
Chinese (zh)
Inventor
张丽
董雨辰
张翔宇
杜慧
解峥
钟习
陈志鹏
俞晓明
刘悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN202011290565.XA priority Critical patent/CN112668316A/en
Publication of CN112668316A publication Critical patent/CN112668316A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a word document key information extraction method, which comprises the following steps: step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, if so, entering a step two, otherwise, entering a step three; step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file; and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph and inputting the paragraph into the region corresponding to the information category. The invention utilizes the template style attribute information in the word document, thereby greatly improving the efficiency of extracting the key information from the word document.

Description

word document key information extraction method
Technical Field
The invention relates to the technical field of information content processing. More specifically, the invention relates to a method for extracting key information of a word document.
Background
The existing method for extracting the key information of the MS Word document mainly comprises the steps of compiling a specific program by a programmer for extraction, wherein the specific differences of various methods are large, and a fixed standard does not exist. The existing key information extraction cannot effectively extract paragraphs with styles in MS Word documents; the customizability of the prior art is poor, and a user can not select which types of key information to extract many times; for the section without the pattern, no effective extraction scheme exists; and the output of the extracted file is not standard.
Disclosure of Invention
An object of the present invention is to solve at least the above problems and to provide at least the advantages described later.
The invention also aims to provide a method for extracting the key information of the word document, which classifies the paragraphs of the word document according to whether the paragraphs have the template style attributes by utilizing the template style attribute information of the paragraphs of the word document, adopts different key information extraction methods for the paragraphs of different types, and greatly improves the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting files with uniform format, so that the result of the program is clearer.
To achieve these objects and other advantages in accordance with the present invention, there is provided a word document key information extraction method, comprising:
step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, and entering step two if the paragraph has the template style attribute; if the template does not have the template style attribute, entering a third step;
step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file;
and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph, and inputting the paragraph into an area corresponding to the information category in the first output file.
Preferably, in the word document key information extraction method, the preset list of key information categories to be extracted at least includes a title, a text, a table and other categories.
Preferably, in the method for extracting the key information of the word document, in the third step, the information category of the paragraph identified based on the preset neural network model specifically includes: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;
wherein, M ═ M1、m2、…mn]Wherein m represents a format attribute;
The neural network model comprises three fully-connected layers, wherein the output dimensionality of the first fully-connected layer is 50; the output dimension of the second fully-connected layer is 20, and the output dimension of the third fully-connected layer is n; n is equal to the number of categories in the preset key information category information to be extracted.
Preferably, in the word document key information extraction method, the format attribute includes at least one of a font size, a font style, a text length, a segment spacing, whether to be darkened, whether to be italicized, and the like.
Preferably, the method for extracting the key information of the word document further comprises a fourth step of performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two.
Preferably, in the word document key information extraction method, the first file is in a json format.
Preferably, the word document key information extraction method includes the following specific steps: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;
the method comprises the steps that a file is a word document or a file folder, and if the file is the word document, the word document is obtained and all paragraphs in the word document are traversed; and if the file is a folder, starting a plurality of threads, wherein one thread correspondingly acquires at least one word document in the folder and traverses all paragraphs in the word document.
Preferably, in the method for extracting key information of a word document, the configuration file further includes a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted.
The invention also provides a word document key information extraction device, which comprises:
a processor;
a memory storing executable instructions;
the processor is configured to execute the executable instructions to execute the word document key information extraction method.
The invention at least comprises the following beneficial effects:
1. the invention uses the template style attribute information of word document paragraphs to classify the paragraphs of the word document according to whether the paragraphs have the template style attribute, and adopts different key information extraction methods for different types of paragraphs, thereby greatly improving the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting the file with the uniform format, so that the result of the program is clearer;
2. according to the method, the content of the key information category to be extracted is prestored by using the configuration file, the program reads the configuration information from the configuration file and extracts the target word document, so that the flexibility and the customizability of the program are improved;
3. for the non-style paragraphs, the information categories of the non-style paragraphs are calculated and identified by adopting a preset neural network model, and then targeted extraction is performed, so that the processing efficiency of the non-style paragraphs is greatly improved, and the extraction efficiency of the key information of the word document is further achieved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.
Drawings
FIG. 1 is a flowchart of a word document key information extraction method according to the present invention.
Detailed Description
The present invention is further described in detail below with reference to the drawings and examples so that those skilled in the art can practice the invention with reference to the description.
It will be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.
In the description of the present invention, the terms "lateral", "longitudinal", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
The invention provides a word document key information extraction method, which comprises the following steps:
step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, and entering step two if the paragraph has the template style attribute; if the template does not have the template style attribute, entering a third step;
step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file;
and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph, and inputting the paragraph into an area corresponding to the information category in the first output file.
In the technical scheme, the method and the device utilize the template style attribute information of the word document paragraphs, classify the paragraphs of the word document according to whether the paragraphs have the template style attribute, adopt different key information extraction methods for different types of paragraphs, and greatly improve the extraction efficiency of the key information of the word document; the invention outputs the extracted key information by adopting files with uniform format, so that the result of the program is clearer.
In another technical scheme, in the method for extracting the key information of the word document, a preset list of key information categories to be extracted at least comprises a title, a text, a table and other categories. The method comprises the steps of extracting three key information of a title, a text and a table in a document, and summarizing the key information in a first file according to categories so that a client can quickly obtain main content of the document and master key and important information of the document.
In another technical scheme, in the method for extracting the key information of the word document, the step three is specifically that the information category of the paragraph identified based on the preset neural network model is as follows: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;
wherein, M ═ M1、m2、…mn]Wherein m represents a format attribute;
the neural network model comprises three fully-connected layers, wherein the output dimensionality of the first fully-connected layer is 50; the output dimension of the second fully-connected layer is 20, and the output dimension of the third fully-connected layer is n; n is equal to the number of categories in the preset key information category information to be extracted.
According to the method, for the style-free paragraphs, the preset neural network model is adopted to calculate and identify the information types of the style-free paragraphs, and then targeted extraction is performed, so that the processing efficiency of the style-free paragraphs is greatly improved, and the extraction efficiency of the key information of the word document is further achieved.
Preprocessing the paragraph to obtain a feature vector corresponding to the paragraph, taking the feature vector as an input value of a neural network model, outputting a result after calculation of the neural network model, and determining the information category of the paragraph according to the output result; the neural network has better generalization performance and can quickly and accurately extract each key information in the document;
the invention adopts a neural network model with a 3-layer full-connection layer structure:
the input dimension of the first full-connection layer is 100, the output dimension is 50, the input feature dimension is 100, and the layer is used for extracting original features to obtain the features of the hidden layer;
the input dimension of the second full-connection layer is 50, the output dimension is 20, the hidden layer feature is processed by using the layer, namely the hidden layer feature is multiplied by a W matrix, and the dimension of the hidden layer feature is changed;
the input dimension of the third fully-connected layer is 20, the specific numerical value of the output dimension of the third fully-connected layer is determined according to the number of the key information categories to be extracted, and the arrangement order of the numerical values of the output dimension is the same as that of the key information categories to be extracted, for example, the key information categories to be extracted are three types of titles, texts and tables, the output dimension of the third fully-connected layer is 3, and softmax is used to change the 3 values into three probability values as the output result of the neural network model, for example, the final output result of a certain paragraph is (0.2, 0.1, 0.7), the sum of the three probability values is 1, wherein 0.2 indicates that the paragraph is a title with a probability of 0.2, 0.1 indicates that the paragraph is a body with a probability of 0.1, 0.7 indicates that the paragraph is a table with a probability of 0.7, the information type of the paragraph can be determined to be a table, and the paragraph can be extracted and input to the area stored in the table in the first document.
Due to the classification problem, a cross-entropy loss function is selected, the loss function is used to calculate the difference between the output value of the third fully-connected layer and the true class, an error is obtained, and then parameters of the third fully-connected layer are optimized using back propagation and gradient descent. The parameters of the second fully-connected layer and the first fully-connected layer are optimized using the chain rule.
The neural network model with the structure is adopted to extract the document, the effects of the verification set and the test set are better, the neural network model is simple in structure, fast in operation and small in error, and the extraction of the key information of the document can be accurately and fast completed.
In another technical scheme, the format attribute includes at least one of a font size, a font style, a text length, a segment interval, whether to be blackened, whether to be bolded, whether to be italicized, and the like. Through feature processing of the format attribute, each style-free paragraph can be represented by a vector with a fixed dimension, each dimension of the vector represents a feature of the paragraph, and the feature may be discrete or continuous. For example, a font is a discrete feature, 1 for sones, 2 for bold, 3 for clerks, and so on. The line spacing is a continuous feature, and the value of the feature is the value of the line spacing. If the font of a paragraph is song font, the font size is second font, bold, underline, italic, etc. are used, the paragraph is processed according to the format attribute to obtain vector numerical expression of [1, 2, 1, 0, 0, x ].
In another technical scheme, the method for extracting the key information of the word document further comprises the fourth step of carrying out format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two. After the extraction of the key information of the document is finished, format processing is carried out on the document according to preset format attributes (template style attribute characteristics), namely, titles, texts, tables and the like of the document are typeset according to a unified and standard format and are output as a second file, so that the later management and the lookup of the document by a client are facilitated.
In another technical scheme, in the word document key information extraction method, the first file is in a json format. The Json format file is convenient for data transmission and analysis.
In another technical scheme, the method for extracting the key information of the word document comprises the following steps of: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;
the method comprises the steps that a file is a word document or a file folder, and if the file is the word document, the word document is obtained and all paragraphs in the word document are traversed; and if the file is a folder, starting a plurality of threads, wherein one thread correspondingly acquires at least one word document in the folder and traverses all paragraphs in the word document.
A file name field to be processed, wherein the file storage path field to be processed is first type configuration information in a configuration file, file _ to _ extract; for example, if a file of the computer F disk needs to be processed, "file _ to _ extract" is filled in the configuration file: "F/data/"; before document extraction, a program firstly reads a configuration file, reads and analyzes a 'file _ to _ extract' field, and obtains a document to be processed according to a file storage path and a file name;
the invention can process not only a single document, but also a folder in which a plurality of documents are stored, if the document is a single document, the program directly obtains the document and traverses all paragraphs of the document; if the document is a folder, starting a plurality of threads, wherein one thread correspondingly processes at least one document in the folder, any thread performs recursive processing extraction on the corresponding at least one document, and for the processing thread of any document, traversing all paragraphs of the corresponding document, finally merging the extraction results of all documents in the folder and outputting the merged extraction results through one file.
In another technical scheme, in the method for extracting the key information of the word document, the configuration file further comprises a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted. The key information category field to be extracted is second configuration file information, "class _ to _ extract"; for example, extracting the title, text and table key information in the document, "class _ to _ extract" may be written in the configuration file: [0, 1, 2, 3, 4, 5], wherein 0 represents a title, 1 represents a body, 2 represents a table, 3 represents a title in a style-free paragraph, 4 represents a body in a style-free paragraph, and 5 represents a table in a style-free paragraph; the key information to be extracted is pre-stored in the configuration file, and the program reads the information from the configuration file, sets the information and then extracts the information, so that the flexibility and the customizability of the program are improved.
The documents listed in the present invention are all represented as word documents.
The invention also provides a word document key information extraction device, which comprises:
a processor;
a memory storing executable instructions;
the processor is configured to execute the executable instructions to execute the word document key information extraction method.
The technical scheme is obtained based on the same inventive concept as the word document key information extraction method, and reference can be made to the description of the method part. The device of the technical scheme is not limited to the pc, the terminal and the server. The device can be arranged in the server to acquire the file and extract the key information of the file.
The following is a specific example: extracting three key information of a title, a text and a table of a certain file in a computer;
as shown in fig. 1, the method for extracting the key information of the word document comprises the following steps:
step 100, filling in a configuration file: { "file _ to _ extract": "F/data/", "class _ to _ extract": [0, 1, 2, 3, 4, 5] };
file _ to _ extract: the field is the name of the file or folder to be extracted;
class _ to _ extract: this field is a list of categories of key information to be extracted. Where 0 represents a title, 1 represents a body, 2 represents a table, 3 represents a title in a style-free paragraph, 4 represents a body in a style-free paragraph, and 5 represents a table in a style-free paragraph.
Step 200, running a program, reading a configuration file and analyzing: firstly, analyzing a file name, acquiring a file _ to _ extract field filled by a user, analyzing and acquiring a file to be processed, judging whether the file is a single file or a folder, if the file is a file, directly acquiring the file by a program, if the file is a folder containing a plurality of files, starting a plurality of threads, wherein one thread corresponds to at least one file in the folder, any thread performs recursive processing extraction on the corresponding at least one file, traversing all paragraphs of the corresponding file for the processing thread of any file, and finally merging extraction results of all files in the folder and outputting the extraction results through one file;
then, the class _ to _ extract field is read and the category of the key information to be extracted is analyzed, and the program sets the category of the key information to be extracted to form a list of the category of the key information to be extracted, [0, 1, 2, 3, 4, 5 ].
Step 300, judging whether the acquired document has a template style attribute, and if so, entering step 301; if the template style attribute does not exist, go to step 302;
step 301, obtaining a paragraph information category according to the template style attribute of the paragraph, matching the paragraph information category with a preset key information category list to be extracted, and judging whether the paragraph belongs to one of the categories in the key information category list to be extracted, if the paragraph belongs to the region corresponding to the information category to which the paragraph is extracted and input into the first output file, otherwise, not extracting the paragraph;
step 302, preprocessing the paragraphs without template style attributes according to a preset format attribute rule, extracting to obtain a feature vector M, taking the feature vector M as an input value of a preset neural network model, and obtaining an output result of the neural network model, [ P ]1、P2、…Pn]The information category corresponding to the P with the largest numerical value is the information category to which the paragraph belongs, the information category is matched with a preset key information category list to be extracted, whether the paragraph belongs to one of the categories in the key information category list to be extracted is judged, if the paragraph belongs to the region corresponding to the information category to which the paragraph belongs, the paragraph is extracted and input into the first output file, and if not, the paragraph is not extracted;
and step 400, performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file II.
The number of apparatuses and the scale of the process described herein are intended to simplify the description of the present invention. Applications, modifications and variations of the present invention will be apparent to those skilled in the art.
While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims (9)

  1. The method for extracting the key information of the word document is characterized by comprising the following steps:
    step one, obtaining a source word document, traversing paragraphs of the word document, judging whether any paragraph has a template style attribute, and entering step two if the paragraph has the template style attribute; if the template does not have the template style attribute, entering a third step;
    step two, acquiring paragraph information categories according to template style attributes of paragraphs, matching the paragraph information categories with a preset key information category list to be extracted, extracting the paragraphs and inputting the paragraphs into a region corresponding to the information categories in the first output file;
    and step three, identifying the information category of the paragraph based on a preset neural network model, matching the information category with a preset key information category list to be extracted, extracting the paragraph, and inputting the paragraph into an area corresponding to the information category in the first output file.
  2. 2. The word document key information extraction method according to claim 1, wherein the preset key information category list to be extracted at least includes categories such as a title, a text, a table and the like.
  3. 3. The word document key information extraction method according to claim 2, wherein in step three, the information category of the paragraph identified based on the preset neural network model specifically is: preprocessing the paragraph according to a preset format attribute rule, extracting to obtain a feature vector M, inputting the feature vector M into a preset neural network model, obtaining an output result of the neural network model, and determining the information category of the paragraph according to the output result;
    wherein, M ═ M1、m2、…mn]Wherein m represents a format attribute;
    the neural network model comprises three fully-connected layers, wherein the output dimensionality of the first fully-connected layer is 50; the output dimension of the second fully-connected layer is 20, and the output dimension of the third fully-connected layer is n; n is equal to the number of categories in the preset key information category information to be extracted.
  4. 4. The word document key information extraction method of claim 3, wherein the format attribute includes at least one of a font size, a font style, a text length, a segment spacing, whether to darken, whether to italics, and the like.
  5. 5. The word document key information extraction method of claim 4, further comprising a fourth step of performing format processing on all paragraphs of the word document according to preset format attributes, and forming a new word document as an output file two.
  6. 6. The word document key information extraction method of claim 5, wherein the first file is in a json format.
  7. 7. The method for extracting word document key information as claimed in claim 6, wherein the step one of obtaining the word document specifically comprises: filling a configuration file, wherein the configuration file comprises a file name field to be processed and a file storage path field to be processed; reading a file name field to be processed and a file storage path field to be processed, analyzing the file name and the file storage path, and acquiring a file;
    the method comprises the steps that a file is a word document or a file folder, and if the file is the word document, the word document is obtained and all paragraphs in the word document are traversed; and if the file is a folder, starting a plurality of threads, wherein one thread correspondingly acquires at least one word document in the folder and traverses all paragraphs in the word document.
  8. 8. The word document key information extraction method of claim 7, wherein the configuration file further includes a key information category field to be extracted; in the first step, the word document is obtained, meanwhile, the key information category field to be extracted is read, and the key information category to be extracted is set to form a preset key information category list to be extracted.
  9. The word document key information extraction device is characterized by comprising:
    a processor;
    a memory storing executable instructions;
    wherein the processor is configured to execute the executable instructions to execute the method for extracting the key information of the word document according to any one of claims 1 to 8.
CN202011290565.XA 2020-11-17 2020-11-17 word document key information extraction method Pending CN112668316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011290565.XA CN112668316A (en) 2020-11-17 2020-11-17 word document key information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011290565.XA CN112668316A (en) 2020-11-17 2020-11-17 word document key information extraction method

Publications (1)

Publication Number Publication Date
CN112668316A true CN112668316A (en) 2021-04-16

Family

ID=75403679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011290565.XA Pending CN112668316A (en) 2020-11-17 2020-11-17 word document key information extraction method

Country Status (1)

Country Link
CN (1) CN112668316A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362026A (en) * 2021-06-04 2021-09-07 北京金山数字娱乐科技有限公司 Text processing method and device
CN117115844A (en) * 2023-10-19 2023-11-24 安徽科大国创智信科技有限公司 Intelligent data entry method for entity document

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259669A (en) * 1999-03-12 2000-09-22 Ntt Data Corp Document classification device and its method
CN1808424A (en) * 2005-01-21 2006-07-26 北京软件产品质量检测检验中心 Method of abstracting key information from documents
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106446089A (en) * 2016-09-12 2017-02-22 北京大学 Method for extracting and storing multidimensional field key knowledge
CN108536683A (en) * 2018-04-18 2018-09-14 同方知网数字出版技术股份有限公司 A kind of paper fragmentation information abstracting method based on machine learning
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109657114A (en) * 2018-08-21 2019-04-19 国家计算机网络与信息安全管理中心 A method of extracting webpage semi-structured data
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN111506696A (en) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 Information extraction method and device based on small number of training samples
CN111783399A (en) * 2020-06-24 2020-10-16 北京计算机技术及应用研究所 Legal referee document information extraction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259669A (en) * 1999-03-12 2000-09-22 Ntt Data Corp Document classification device and its method
CN1808424A (en) * 2005-01-21 2006-07-26 北京软件产品质量检测检验中心 Method of abstracting key information from documents
CN106294320A (en) * 2016-08-04 2017-01-04 武汉数为科技有限公司 A kind of terminology extraction method and system towards scientific paper
CN106446089A (en) * 2016-09-12 2017-02-22 北京大学 Method for extracting and storing multidimensional field key knowledge
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN108536683A (en) * 2018-04-18 2018-09-14 同方知网数字出版技术股份有限公司 A kind of paper fragmentation information abstracting method based on machine learning
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN109657114A (en) * 2018-08-21 2019-04-19 国家计算机网络与信息安全管理中心 A method of extracting webpage semi-structured data
CN111506696A (en) * 2020-03-03 2020-08-07 平安科技(深圳)有限公司 Information extraction method and device based on small number of training samples
CN111783399A (en) * 2020-06-24 2020-10-16 北京计算机技术及应用研究所 Legal referee document information extraction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王博博: "面向上市公司三类信息披露公告的信息抽取系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2, pages 138 - 2204 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362026A (en) * 2021-06-04 2021-09-07 北京金山数字娱乐科技有限公司 Text processing method and device
CN117115844A (en) * 2023-10-19 2023-11-24 安徽科大国创智信科技有限公司 Intelligent data entry method for entity document
CN117115844B (en) * 2023-10-19 2024-01-12 安徽科大国创智信科技有限公司 Intelligent data entry method for entity document

Similar Documents

Publication Publication Date Title
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
Benchimol et al. Text mining methodologies with R: An application to central bank texts
JP7281905B2 (en) Document evaluation device, document evaluation method and program
CN112668316A (en) word document key information extraction method
CN106909609A (en) Method for determining similar character strings, method and system for searching duplicate files
CN116628229B (en) Method and device for generating text corpus by using knowledge graph
Toomey R for data science
US20240054284A1 (en) Spreadsheet table transformation
Zanibbi et al. Math search for the masses: Multimodal search interfaces and appearance-based retrieval
US20230237251A1 (en) Deriving global intent from a composite document to facilitate editing of the composite document
JP2020098592A (en) Method, device and storage medium of extracting web page content
Garrido-Munoz et al. A holistic approach for image-to-graph: application to optical music recognition
CN111143642A (en) Webpage classification method and device, electronic equipment and computer readable storage medium
CN112732910B (en) Cross-task text emotion state evaluation method, system, device and medium
CN111951079B (en) Credit rating method and device based on knowledge graph and electronic equipment
Trisal et al. K-RCC: A novel approach to reduce the computational complexity of KNN algorithm for detecting human behavior on social networks
CN113204624A (en) Multi-feature fusion text emotion analysis model and device
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
CN111241329A (en) Image retrieval-based ancient character interpretation method and device
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
KR102570477B1 (en) Method for obtaining automatically user identification object in web page
CN115048536A (en) Knowledge graph generation method and device, computer equipment and storage medium
Xu et al. Estimating similarity of rich internet pages using visual information
CN110109994B (en) Automobile financial wind control system containing structured and unstructured data
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination