CN112989795A - Text information extraction method and device, computer equipment and storage medium - Google Patents

Text information extraction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112989795A
CN112989795A CN202110182750.5A CN202110182750A CN112989795A CN 112989795 A CN112989795 A CN 112989795A CN 202110182750 A CN202110182750 A CN 202110182750A CN 112989795 A CN112989795 A CN 112989795A
Authority
CN
China
Prior art keywords
information
text
extraction
extracted
value pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110182750.5A
Other languages
Chinese (zh)
Inventor
孟泽洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Financial Technology Nanjing Co Ltd
Original Assignee
Suning Financial Technology Nanjing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Financial Technology Nanjing Co Ltd filed Critical Suning Financial Technology Nanjing Co Ltd
Priority to CN202110182750.5A priority Critical patent/CN112989795A/en
Publication of CN112989795A publication Critical patent/CN112989795A/en
Priority to CA3148074A priority patent/CA3148074A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/06Asset management; Financial planning or analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Human Resources & Organizations (AREA)
  • Operations Research (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text information extraction method, a text information extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field, determining the chapter position of each directory information in a file directory in the text to be extracted according to the file directory of the text to be extracted, generating chapter information, dividing the chapter information according to a preset rule, generating a corresponding division list, generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information in a database. On one hand, the method improves the efficiency of text extraction, avoids the problems of information extraction omission, errors and the like, improves the accuracy of text extraction, and on the other hand, by splitting the long text, the method can avoid the infinite backtracking condition possibly encountered in regular matching, increase the fault tolerance rate of codes and reduce the time consumption of overall operation.

Description

Text information extraction method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting text information, a computer device, and a storage medium.
Background
The public text information in the financial field is often very redundant, such as the relevant types of text of common public recruitment specifications, contract announcements, and the like. They are usually compiled from a mashup of information on the order of hundreds of page counts. For fund information extraction tasks, the common processing method in the industry generally copies and extracts information through manual operation and maintenance or extracts a simple regular expression.
However, the conventional processing methods have some obvious disadvantages. For example, the pure manual information extraction method has a very large workload, and involves many repetitive operations, which is inefficient and has a high labor cost. For simple regular expression extraction, information extraction omission may occur, and especially when the amount of public texts in the bulletin is particularly large, information extraction errors often occur due to information similarity between different chapter sections, and a large amount of manpower is required for checking and verifying. In addition, because the structural requirements of different fund issuers on the characters are not uniform, the description texts of the fund state change are usually merged and omitted to different degrees, which also causes the failure of the regular expression extraction mode.
In view of the foregoing, it is desirable to provide a new method for extracting long text information to solve the above problems.
Disclosure of Invention
In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for extracting long text information, a computer device, and a storage medium, so as to overcome the problems in the prior art, such as large workload, low efficiency, high labor cost, and easy omission and error.
In order to solve one or more technical problems, the invention adopts the technical scheme that:
in a first aspect, a method for extracting long text information is provided, which includes the following steps:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
Further, the dividing list includes a paragraph list and a sentence list, the dividing the chapter information according to a preset rule, and the generating of the corresponding dividing list includes:
performing paragraph division on each chapter information according to preset paragraph characteristics to respectively generate corresponding paragraph lists;
and carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics to respectively generate corresponding sentence lists.
Further, when the target information corresponding to the extracted field is long text information, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database includes:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
Further, when the target information corresponding to the extracted field is short text information, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database includes:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
Further, when the extraction field is in a state change, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information in a database comprises:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
Further, before storing the key-value pair information in the database, the method further comprises:
and denoising the key value pair information, and storing the denoised key value pair information into a database.
Further, the extraction rule includes a regular expression.
In a second aspect, there is provided a text information extraction apparatus, the apparatus including:
the data acquisition module is used for acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
the chapter acquisition module is used for determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted and generating chapter information;
the data dividing module is used for dividing the chapter information according to a preset rule to generate a corresponding division list;
and the information generation module is used for generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database, wherein the key comprises an extraction field, and the key comprises the information corresponding to the division list and the extraction field.
In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
In a fourth aspect, there is provided a computer readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
The technical scheme provided by the embodiment of the invention has the following beneficial effects:
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, a text to be extracted and an extraction rule corresponding to the text to be extracted are obtained, wherein the extraction rule comprises an extraction field, the chapter position of each directory information in a file directory in the text to be extracted is determined according to the file directory of the text to be extracted, chapter information is generated, the chapter information is divided according to a preset rule, and a corresponding division list is generated; generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises target information corresponding to the division list and the extraction field, so that on one hand, the efficiency of text extraction is improved, the problems of information extraction omission, errors and the like are avoided, and the accuracy of text extraction is improved, on the other hand, by splitting a long text, the infinite backtracking condition possibly encountered in regular matching can be avoided, the fault tolerance of codes is increased, and the time consumption of overall operation is reduced;
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, paragraph division is performed on each chapter information according to preset paragraph characteristics, corresponding paragraph lists are respectively generated, sentence division is performed on each paragraph in each paragraph list according to preset sentence characteristics, corresponding sentence lists are respectively generated, and the text is accurately positioned to the chapter, paragraph and sentence levels in a directory hierarchical positioning mode, so that relevant information in the text to be extracted is accurately positioned and extracted;
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, noise reduction processing is performed on the key value pair information, the key value pair information after noise reduction processing is stored in the database, the key value pair information extracted from the filtered text is further screened, and the accuracy of information extraction in the long text is effectively improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow diagram illustrating a method for information extraction of a long text of a fund post, according to an example embodiment;
FIG. 2 is a flow diagram illustrating a method for information extraction of a fund state change, according to an example embodiment;
FIG. 3 is a flow diagram illustrating a method of textual information extraction, according to an example embodiment;
fig. 4 is a schematic structural diagram illustrating a text information extraction apparatus according to an exemplary embodiment;
FIG. 5 is a schematic diagram of an internal structure of a computer device shown in accordance with an example embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As described in the background, with respect to common fund information disclosure recruitment specifications, fund contracts, and like related types of information disclosure texts, they are typically summarized by mashups of information on the order of hundreds of counts. Therefore, the information extraction workload for such texts is very large, and the problems of omission, errors and the like are easy to occur.
In order to solve the above problems, the embodiment of the present invention provides a text information extraction method, which starts from a document structure of a text to be extracted, precisely positions the text to chapter, paragraph, and sentence levels in a manner of directory hierarchical positioning, extracts general information in the text to be extracted, uses the sentences and paragraphs as input data, automatically detects data information to be extracted through a multiple rule scheme, and performs denoising and calibration, thereby obtaining key value pair information corresponding to the general information, and similarly, extracts state change information related to a service of the text to be extracted, performs part-of-speech layering with state descriptive words, and extracts a state change list through a [ action-service ] combination form. The accuracy of information extraction can be guaranteed, problems such as omission and errors of information extraction can be avoided, the situation of infinite backtracking possibly encountered in regular matching can be avoided by splitting the long text, the fault tolerance rate of codes is increased, and the time consumption of overall operation is reduced.
Example one
Specifically, as shown in fig. 1, taking the relevant disclosure text of the fund as an example, the process of extracting the information of the fund advertisement long text by using the method includes:
the method comprises the steps of firstly, obtaining an original long text sequence of information to be extracted, wherein the original long text sequence comprises a fund announcement long text;
specifically, the text to be extracted acquired here mainly includes related types of disclosure texts such as fund information disclosure recruitment specifications, fund contracts, and the like acquired from an official website disclosing disclosure information. It should be noted that the fund information disclosure collection instruction, the fund contract, and other related types of disclosure texts in the embodiment of the present invention are only an exemplary illustration and are not limited to the embodiment of the present invention, and besides the above long text, the method provided in the embodiment of the present invention may also be applied to information extraction of other long texts with a fixed directory structure.
Step two, configuring an extraction rule for extracting information of the bulletin long text;
specifically, the process mainly injects extraction rules for subsequent steps. The extraction rules include, but are not limited to, configuration file regular statements and external manual rule citations, regular expressions of the same extraction fields can be used in a superposed mode, the external manual rule citations are mainly used for configuring information such as fields needing to be extracted by a user, and the external manual rule citations can be imported in a form file format and can also be configured through a background operation and maintenance platform during specific implementation. It should be noted that, in the embodiment of the present invention, the extraction rule adopts a mode of combining multiple rules, so that the efficiency and accuracy of information extraction in a long text can be effectively improved.
Thirdly, positioning chapters where the directory information is located according to the file directory of the long text announced, and generating chapter information;
specifically, the method provided by the embodiment of the present invention mainly processes long texts with a fixed directory structure, and the directory information is usually the title information of each chapter. As a better example, when a chapter where the directory information is located, the directory information may be used as an extraction field, and the chapter where the field is located is automatically located and extracted through a preset chapter locating filtering function, so as to generate corresponding chapter information, where the chapter information includes titles and all contents of the chapter. In specific implementation, the chapter blocks of the Chinese document can be positioned through regular expressions.
Step four, dividing the chapter information to generate a corresponding paragraph list and a corresponding sentence list;
specifically, the chapter information generated in the above steps is further refined into a paragraph text block and a sentence text block, and a paragraph list and a sentence list are respectively generated. In a specific implementation, when a paragraph is divided, the text in the chapter can be divided into paragraphs according to the characteristics of the paragraph. Features of a chinese paragraph include, but are not limited to, a blank at the end of a paragraph line and an indentation at the beginning of a line. When the sentence is divided, the paragraphs generated in the front are further extracted according to the sentence characteristics, and the paragraphs are further divided into sentences. Sentence features include, but are not limited to, sentence end symbols, such as periods, exclamation marks, and the like.
Step five, extracting information of the paragraph list and the sentence list according to the extraction rule configured in the step two, and acquiring key value pair information corresponding to the long text of the bulletin;
specifically, the key in the key-value pair information is an extraction field defined in an extraction rule, and the value is related information extracted from the paragraph list and the sentence list according to the extraction rule according to the extraction field. Specifically, different information may be extracted in different ways. For example, in the fund profile, a part of the value is text descriptive information, so that the part of the information can be a word, a few words or a word, and adjacent words (or segments) are introduced for searching. For example, when the end of a certain paragraph or sentence corresponding to the extracted field is a colon, the content following the colon is usually the information to be extracted, and the paragraph or sentence following the colon is taken as the extracted target information. When the value to be extracted is also a specific type of short message, for example, the information to be extracted is a date, and such information is usually included in a sentence, the relevant information included in the sentence may be extracted by using the target detection method.
And step six, performing noise reduction processing on the key value pair information to obtain the processed key value pair information.
Specifically, a series of output values (i.e., key-value pair information) may be obtained through the previous steps. Although the segment corresponding to the information is accurately located, the output values may contain some noise and even some extraction confusion. In order to solve the problem, in the embodiment of the invention, a numerical noise reduction filter is introduced, and redundant or unreasonable results are further purified. The denoising process includes, but is not limited to, value type checking (for intra-sentence value cleaning), value truncation extraction (for inter-sentence information), and the like, which are not described herein again.
And seventhly, carrying out manual examination and verification on the key value pair information subjected to noise reduction processing, and storing the key value pair information passing through the manual examination and verification into a database.
Specifically, the information after manual review and verification can be used as fund basic information, a series of fund diagnosis and screening bases are provided, and data support is provided for internal and external platforms.
Specifically, when the steps are implemented specifically, the steps can be deployed on a pre-constructed big data cloud platform as a PySpark big data task, and the PySpark big data task is used for daily incremental processing fund information extraction tasks, and the output result is stored in a Hive table, so that a long text can be analyzed and explored in a measuring machine of several minutes in real time.
Specifically, as shown in fig. 2, taking a state change announcement text of the fund as an example, the embodiment of the present invention further provides an information extraction method for fund state change, which includes:
step R0: and extracting a corresponding sentence list from the state change notice text of the fund, analyzing the extracted sentence list and the notice title of the state change notice text, and acquiring an analysis result.
Specifically, the specific process of extracting the sentence list from the state change notice text of the fund may refer to the specific contents of the first step to the fourth step, which are not described herein again. It should also be noted that the status announcement text of the fund in the embodiment of the present invention is only an exemplary illustration and is not limited to the embodiment of the present invention, and besides the status change announcement text of the fund, the information extraction method of the status change provided in the embodiment of the present invention may also be applied to information extraction of other long texts with fixed directory structures.
Specifically, since some titles of the bulletin texts also include information to be extracted, the titles of the bulletin texts need to be considered together when performing the status change analysis. For example, a title of a certain bulletin text is "bulletin about suspended large-amount purchase, fixed-amount delivery and switching-in service of a certain money market fund", and "suspended large-amount purchase, fixed-amount delivery and switching-in service" in the title is also information to be extracted.
Step R1: performing action extraction on the analysis result to acquire action information;
specifically, in the embodiment of the present invention, the action information includes a bulletin text and an action-like word appearing in a title. The action-type words are mainly classified according to the part of speech, and the action-type words related to the business state change, which are common in the financial field, include open, pause, resume, and restriction.
Step R2: extracting the service from the analysis result to obtain service information;
specifically, in the embodiment of the present invention, the service information includes a service property noun appearing in the announcement text and the title, for example: procurement, redemption, commitment, conversion, and transfer to etc. Since the business change relates to the state and the amount of money, the embodiment of the invention also combines the business noun and some modifiers to obtain a new business word, and similarly, the state change such as 'large redemption' is carried out. In addition, the service also relates to commonly used phrases, aliases and abbreviations, and the embodiment of the invention also uniformly replaces the phrases, aliases and abbreviations in the step.
Step R3: generating state change information according to the action information and the service information;
specifically, the action and service phrases extracted in the above steps are arranged and combined, and a completed change list (i.e., state change information) is obtained by matching the state change enumeration values.
Step R4: and verifying the state change information, and storing the state change information in a database after the state change information passes the verification.
Specifically, when the state change information is stored in the database, the state change information may also be stored in a key value pair manner, and when the state change information is specifically implemented, the "state change" field is used as a key, and the extracted specific state change information is used as a value.
Specifically, in the embodiment of the present invention, the action and the service in the changed state are split, because the action or the service description in the state is omitted to a considerable extent in the long text of the fund text disclosure (for example, the state change of suspended purchase and redemption includes both suspended purchase and suspended redemption). The condition of incomplete extraction caused by information dislocation and information omission can be effectively relieved even avoided after the splitting.
Example two
Fig. 3 is a flowchart illustrating a text information extraction method according to an exemplary embodiment, and referring to fig. 3, the method includes the steps of:
s1: the method comprises the steps of obtaining a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field.
Specifically, the text to be extracted includes, but is not limited to, fund information disclosure recruitment instructions, and long texts with fixed directory structures for fund contracts. It should be noted here that the information extraction method provided by the embodiment of the present invention may also be applied to information extraction of long texts with other structural style comparison specifications. The extraction rules comprise configuration file regular sentences and custom rules, wherein the custom rules are mainly used for configuring information such as fields and the like required to be extracted by a user, and the custom rules can be adjusted according to actual requirements of the user, so that different information extraction requirements are met. The extraction rule adopts a mode of combining multiple rules, so that the efficiency and the accuracy of information extraction in a long text can be effectively improved.
S2: and determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information.
In particular, some documents typically have a relatively fixed template structure, such as a directory structure. In the embodiment of the invention, the document directory of the document to be extracted is utilized to accurately position the document to chapter, paragraph and sentence levels in a directory hierarchical positioning mode, so as to prepare for subsequent information extraction. When the chapters are located, the directory information can be used as an extraction field (the directory information is usually the title of each chapter), the chapter where the field is located is automatically located and extracted in a regular expression mode, and corresponding chapter information is generated.
S3: dividing the chapter information according to a preset rule to generate a corresponding division list;
specifically, in order to improve the accuracy of information extraction, after the chapters of the text to be extracted are located, the chapter information needs to be further subdivided, the chapter information corresponding to each chapter is divided into paragraphs, the paragraphs are then sequentially divided into sentences, and a paragraph list and a sentence list are respectively generated according to the division results for use in the subsequent steps. The specific dividing process may refer to the content recorded in the relevant steps in the first embodiment, and is not described herein again.
S4: and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
Specifically, information extraction is carried out on the obtained division list according to the extraction field and the extraction rule, information needing to be extracted is obtained, then key value pair information is generated by the extraction field and the extracted information, and the key value pair information is stored in a database.
As a preferred implementation manner, in an embodiment of the present invention, the dividing list includes a paragraph list and a sentence list, and the dividing the chapter information according to a preset rule to generate a corresponding dividing list includes:
performing paragraph division on each chapter information according to preset paragraph characteristics to respectively generate corresponding paragraph lists;
and carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics to respectively generate corresponding sentence lists.
Specifically, the paragraph division and sentence division process can refer to the content recorded in the relevant steps in the first embodiment, and will not be described herein again.
As a preferred implementation manner, in an embodiment of the present invention, when target information corresponding to an extraction field is long text information, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
Specifically, when the target information corresponding to the extracted field is long text information, such as in the fund profile information, part of the information is text descriptive information, and therefore this part of the information may be a word, several words, or a word, and adjacent words (or segments) may be introduced for searching, that is, adjacent words or segments are also taken into consideration. For example, when the end of a certain paragraph or sentence corresponding to the extracted field is a colon, the content following the colon is usually the information to be extracted, and the paragraph or sentence following the colon is taken as the extracted target information.
As a preferred implementation manner, in the embodiment of the present invention, when the target information corresponding to the extracted field is short text information, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
Specifically, when the target information corresponding to the extracted field is short text information, for example, when the information to be extracted is a date, since such information is usually mixed in a sentence, at this time, the relevant information included in the sentence may be extracted in a target detection manner.
As a preferred implementation manner, in an embodiment of the present invention, when an extraction field is changed in state, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
Specifically, the process of extracting the state change information may refer to the content of the process of extracting the fund state change information in the first embodiment, and details are not described here.
As a preferred implementation manner, in an embodiment of the present invention, before storing the key-value pair information in the database, the method further includes:
and denoising the key value pair information, and storing the denoised key value pair information into a database.
Specifically, in order to improve the accuracy of information extraction, the key value pair information generated in the above steps is further filtered, and in specific implementation, the key value pair information may be subjected to noise reduction processing to remove redundant or unreasonable results.
As a preferred implementation manner, in the embodiment of the present invention, the extraction rule includes a regular expression.
Fig. 4 is a schematic structural diagram illustrating a text information extraction apparatus according to an exemplary embodiment, and referring to fig. 4, the apparatus includes:
the data acquisition module is used for acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
the chapter acquisition module is used for determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted and generating chapter information;
the data dividing module is used for dividing the chapter information according to a preset rule to generate a corresponding division list;
and the information generation module is used for generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database, wherein the key comprises an extraction field, and the key comprises the information corresponding to the division list and the extraction field.
As a preferred implementation manner, in an embodiment of the present invention, the data dividing module includes:
the paragraph dividing unit is used for carrying out paragraph division on each chapter information according to preset paragraph characteristics and respectively generating corresponding paragraph lists;
and the sentence dividing unit is used for carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics and respectively generating corresponding sentence lists.
As a preferred implementation manner, in an embodiment of the present invention, the information generating module is specifically configured to:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in an embodiment of the present invention, the information generating module is further configured to:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in an embodiment of the present invention, the information generating module is further configured to:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
As a preferred implementation manner, in an embodiment of the present invention, the apparatus further includes:
and the noise reduction processing module is used for carrying out noise reduction processing on the key value pair information and storing the key value pair information subjected to noise reduction processing into a database.
As a preferred implementation manner, in the embodiment of the present invention, the extraction rule includes a regular expression.
Fig. 5 is a schematic diagram illustrating an internal configuration of a computer device according to an exemplary embodiment, which includes a processor, a memory, and a network interface connected through a system bus, as shown in fig. 5. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of optimization of an execution plan.
Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing devices to which aspects of the present invention may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As a preferred implementation manner, in an embodiment of the present invention, the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following steps when executing the computer program:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
performing paragraph division on each chapter information according to preset paragraph characteristics to respectively generate corresponding paragraph lists;
and carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics to respectively generate corresponding sentence lists.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:
and denoising the key value pair information, and storing the denoised key value pair information into a database.
As a preferred implementation manner, in the embodiment of the present invention, the extraction rule includes a regular expression.
In an embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:
and denoising the key value pair information, and storing the denoised key value pair information into a database.
As a preferred implementation manner, in the embodiment of the present invention, the extraction rule includes a regular expression.
In summary, the technical solution provided by the embodiment of the present invention has the following beneficial effects:
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, a text to be extracted and an extraction rule corresponding to the text to be extracted are obtained, wherein the extraction rule comprises an extraction field, the chapter position of each directory information in a file directory in the text to be extracted is determined according to the file directory of the text to be extracted, chapter information is generated, the chapter information is divided according to a preset rule, and a corresponding division list is generated; generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises target information corresponding to the division list and the extraction field, so that on one hand, the efficiency of text extraction is improved, the problems of information extraction omission, errors and the like are avoided, and the accuracy of text extraction is improved, on the other hand, by splitting a long text, the infinite backtracking condition possibly encountered in regular matching can be avoided, the fault tolerance of codes is increased, and the time consumption of overall operation is reduced;
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, paragraph division is performed on each chapter information according to preset paragraph characteristics, corresponding paragraph lists are respectively generated, sentence division is performed on each paragraph in each paragraph list according to preset sentence characteristics, corresponding sentence lists are respectively generated, and the text is accurately positioned to the chapter, paragraph and sentence levels in a directory hierarchical positioning mode, so that relevant information in the text to be extracted is accurately positioned and extracted;
according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, noise reduction processing is performed on the key value pair information, the key value pair information after noise reduction processing is stored in the database, the key value pair information extracted from the filtered text is further screened, and the accuracy of information extraction in the long text is effectively improved.
It should be noted that: the text information extraction device provided in the foregoing embodiment is only illustrated by dividing the functional modules when triggering the extraction service, and in practical applications, the function allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the text information extraction device and the text information extraction method provided by the above embodiment belong to the same concept, that is, the device is based on the text information extraction method, and the specific implementation process thereof is described in the method embodiment and is not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A text information extraction method is characterized by comprising the following steps:
acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;
dividing the chapter information according to a preset rule to generate a corresponding division list;
and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.
2. The method of claim 1, wherein the division list includes a paragraph list and a sentence list, and the dividing the chapter information according to a preset rule to generate the corresponding division list includes:
performing paragraph division on each chapter information according to preset paragraph characteristics to respectively generate corresponding paragraph lists;
and carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics to respectively generate corresponding sentence lists.
3. The method according to claim 1 or 2, wherein when the target information corresponding to the extracted field is long text information, the generating key-value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key-value pair information into a database comprises:
determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;
searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
4. The method according to claim 1 or 2, wherein when the target information corresponding to the extracted field is short text information, the generating key-value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key-value pair information into a database comprises:
carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;
and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.
5. The method of claim 2, wherein when the extraction field is changed in state, the generating key-value pair information corresponding to the text to be extracted according to the partition list and the extraction rule and storing the key-value pair information into a database comprises:
and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.
6. The text information extraction method according to claim 1 or 2, wherein before storing the key-value pair information in the database, the method further comprises:
and denoising the key value pair information, and storing the denoised key value pair information into a database.
7. The text information extraction method according to claim 1 or 2, wherein the extraction rule includes a regular expression.
8. A text information extraction apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;
the chapter acquisition module is used for determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted and generating chapter information;
the data dividing module is used for dividing the chapter information according to a preset rule to generate a corresponding division list;
and the information generation module is used for generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database, wherein the key comprises an extraction field, and the key comprises the information corresponding to the division list and the extraction field.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110182750.5A 2021-02-09 2021-02-09 Text information extraction method and device, computer equipment and storage medium Pending CN112989795A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110182750.5A CN112989795A (en) 2021-02-09 2021-02-09 Text information extraction method and device, computer equipment and storage medium
CA3148074A CA3148074A1 (en) 2021-02-09 2022-02-08 Text information extracting method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110182750.5A CN112989795A (en) 2021-02-09 2021-02-09 Text information extraction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112989795A true CN112989795A (en) 2021-06-18

Family

ID=76392949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110182750.5A Pending CN112989795A (en) 2021-02-09 2021-02-09 Text information extraction method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112989795A (en)
CA (1) CA3148074A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708308B (en) * 2024-02-06 2024-05-14 四川蓉城蕾茗科技有限公司 RAG natural language intelligent knowledge base management method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294595A (en) * 2016-07-29 2017-01-04 海尔优家智能科技(北京)有限公司 A kind of document storage, search method and device
CN107729481A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 The Text Information Extraction result screening technique and device of a kind of custom rule
CN109582772A (en) * 2018-11-27 2019-04-05 平安科技(深圳)有限公司 Contract information extracting method, device, computer equipment and storage medium
CN111522531A (en) * 2020-04-16 2020-08-11 北京奇艺世纪科技有限公司 File checking method and device, electronic equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294595A (en) * 2016-07-29 2017-01-04 海尔优家智能科技(北京)有限公司 A kind of document storage, search method and device
CN107729481A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 The Text Information Extraction result screening technique and device of a kind of custom rule
CN109582772A (en) * 2018-11-27 2019-04-05 平安科技(深圳)有限公司 Contract information extracting method, device, computer equipment and storage medium
CN111522531A (en) * 2020-04-16 2020-08-11 北京奇艺世纪科技有限公司 File checking method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CA3148074A1 (en) 2022-08-09

Similar Documents

Publication Publication Date Title
AU2019383320B2 (en) Visualizing comment sentiment
WO2019075390A1 (en) Blackbox matching engine
CN106934069B (en) Data retrieval method and system
CN110609910B (en) Medical knowledge graph construction method and device, storage medium and electronic equipment
JP2022042497A (en) Automatically generating pipeline of new machine learning project from pipeline of existing machine learning project stored in corpus
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN111369980A (en) Voice detection method and device, electronic equipment and storage medium
CN109299467A (en) Medicine text recognition method and device, sentence identification model training method and device
CN110188180B (en) Method and device for determining similar problems, electronic equipment and readable storage medium
CN111178701A (en) Risk control method and device based on feature derivation technology and electronic equipment
US9881004B2 (en) Gender and name translation from a first to a second language
CN114678141A (en) Method, apparatus and medium for predicting drug-pair interaction relationship
CN112989795A (en) Text information extraction method and device, computer equipment and storage medium
CN115146634A (en) Processing method for converting emergency plan into to-be-processed flow chart and related device
CN114020774A (en) Method, device and equipment for processing multiple rounds of question-answering sentences and storage medium
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
CN113901793A (en) Event extraction method and device combining RPA and AI
Newman et al. A controllable QA-based framework for decontextualization
CN114840507A (en) Data governance method and device, electronic equipment and storage medium
CN114115831A (en) Data processing method, device, equipment and storage medium
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
US11783112B1 (en) Framework agnostic summarization of multi-channel communication
JP7032582B1 (en) Information analysis program, information analysis method and information analysis device
Malak Text Preprocessing: A Tool of Information Visualization and Digital Humanities
JP3416918B2 (en) Automatic keyword extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination