CN112989795A

CN112989795A - Text information extraction method and device, computer equipment and storage medium

Info

Publication number: CN112989795A
Application number: CN202110182750.5A
Authority: CN
Inventors: 孟泽洋
Original assignee: Suning Financial Technology Nanjing Co Ltd
Current assignee: Suning Financial Technology Nanjing Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-18
Also published as: CA3148074A1

Abstract

The invention discloses a text information extraction method, a text information extraction device, computer equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field, determining the chapter position of each directory information in a file directory in the text to be extracted according to the file directory of the text to be extracted, generating chapter information, dividing the chapter information according to a preset rule, generating a corresponding division list, generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information in a database. On one hand, the method improves the efficiency of text extraction, avoids the problems of information extraction omission, errors and the like, improves the accuracy of text extraction, and on the other hand, by splitting the long text, the method can avoid the infinite backtracking condition possibly encountered in regular matching, increase the fault tolerance rate of codes and reduce the time consumption of overall operation.

Description

Text information extraction method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for extracting text information, a computer device, and a storage medium.

Background

The public text information in the financial field is often very redundant, such as the relevant types of text of common public recruitment specifications, contract announcements, and the like. They are usually compiled from a mashup of information on the order of hundreds of page counts. For fund information extraction tasks, the common processing method in the industry generally copies and extracts information through manual operation and maintenance or extracts a simple regular expression.

However, the conventional processing methods have some obvious disadvantages. For example, the pure manual information extraction method has a very large workload, and involves many repetitive operations, which is inefficient and has a high labor cost. For simple regular expression extraction, information extraction omission may occur, and especially when the amount of public texts in the bulletin is particularly large, information extraction errors often occur due to information similarity between different chapter sections, and a large amount of manpower is required for checking and verifying. In addition, because the structural requirements of different fund issuers on the characters are not uniform, the description texts of the fund state change are usually merged and omitted to different degrees, which also causes the failure of the regular expression extraction mode.

In view of the foregoing, it is desirable to provide a new method for extracting long text information to solve the above problems.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a method and an apparatus for extracting long text information, a computer device, and a storage medium, so as to overcome the problems in the prior art, such as large workload, low efficiency, high labor cost, and easy omission and error.

In order to solve one or more technical problems, the invention adopts the technical scheme that:

in a first aspect, a method for extracting long text information is provided, which includes the following steps:

acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;

determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information;

dividing the chapter information according to a preset rule to generate a corresponding division list;

and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.

Further, the dividing list includes a paragraph list and a sentence list, the dividing the chapter information according to a preset rule, and the generating of the corresponding dividing list includes:

performing paragraph division on each chapter information according to preset paragraph characteristics to respectively generate corresponding paragraph lists;

and carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics to respectively generate corresponding sentence lists.

Further, when the target information corresponding to the extracted field is long text information, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database includes:

determining a first paragraph or a first sentence in which the extraction field is located in the division list, and determining a second paragraph adjacent to the first paragraph or a second sentence adjacent to the first sentence;

searching the first paragraph and the second paragraph or the first sentence and the second sentence by adopting a preset search rule, and determining target information corresponding to the extraction field;

and generating key value pair information corresponding to the text to be extracted according to the extraction field and the target information, and storing the key value pair information in a database.

Further, when the target information corresponding to the extracted field is short text information, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database includes:

carrying out target detection processing on the sentences in the division list to acquire target information corresponding to the extraction field;

Further, when the extraction field is in a state change, the generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information in a database comprises:

and acquiring service state change information in the sentence list according to the extraction rule, and generating key value pair information corresponding to the text to be extracted according to the service state change information and the extraction field and storing the key value pair information in a database.

Further, before storing the key-value pair information in the database, the method further comprises:

and denoising the key value pair information, and storing the denoised key value pair information into a database.

Further, the extraction rule includes a regular expression.

In a second aspect, there is provided a text information extraction apparatus, the apparatus including:

the data acquisition module is used for acquiring a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field;

the chapter acquisition module is used for determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted and generating chapter information;

the data dividing module is used for dividing the chapter information according to a preset rule to generate a corresponding division list;

and the information generation module is used for generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key value pair information into a database, wherein the key comprises an extraction field, and the key comprises the information corresponding to the division list and the extraction field.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the following steps are implemented:

In a fourth aspect, there is provided a computer readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, a text to be extracted and an extraction rule corresponding to the text to be extracted are obtained, wherein the extraction rule comprises an extraction field, the chapter position of each directory information in a file directory in the text to be extracted is determined according to the file directory of the text to be extracted, chapter information is generated, the chapter information is divided according to a preset rule, and a corresponding division list is generated; generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises target information corresponding to the division list and the extraction field, so that on one hand, the efficiency of text extraction is improved, the problems of information extraction omission, errors and the like are avoided, and the accuracy of text extraction is improved, on the other hand, by splitting a long text, the infinite backtracking condition possibly encountered in regular matching can be avoided, the fault tolerance of codes is increased, and the time consumption of overall operation is reduced;

according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, paragraph division is performed on each chapter information according to preset paragraph characteristics, corresponding paragraph lists are respectively generated, sentence division is performed on each paragraph in each paragraph list according to preset sentence characteristics, corresponding sentence lists are respectively generated, and the text is accurately positioned to the chapter, paragraph and sentence levels in a directory hierarchical positioning mode, so that relevant information in the text to be extracted is accurately positioned and extracted;

according to the text information extraction method, the text information extraction device, the computer equipment and the storage medium, noise reduction processing is performed on the key value pair information, the key value pair information after noise reduction processing is stored in the database, the key value pair information extracted from the filtered text is further screened, and the accuracy of information extraction in the long text is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow diagram illustrating a method for information extraction of a long text of a fund post, according to an example embodiment;

FIG. 2 is a flow diagram illustrating a method for information extraction of a fund state change, according to an example embodiment;

FIG. 3 is a flow diagram illustrating a method of textual information extraction, according to an example embodiment;

fig. 4 is a schematic structural diagram illustrating a text information extraction apparatus according to an exemplary embodiment;

FIG. 5 is a schematic diagram of an internal structure of a computer device shown in accordance with an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As described in the background, with respect to common fund information disclosure recruitment specifications, fund contracts, and like related types of information disclosure texts, they are typically summarized by mashups of information on the order of hundreds of counts. Therefore, the information extraction workload for such texts is very large, and the problems of omission, errors and the like are easy to occur.

In order to solve the above problems, the embodiment of the present invention provides a text information extraction method, which starts from a document structure of a text to be extracted, precisely positions the text to chapter, paragraph, and sentence levels in a manner of directory hierarchical positioning, extracts general information in the text to be extracted, uses the sentences and paragraphs as input data, automatically detects data information to be extracted through a multiple rule scheme, and performs denoising and calibration, thereby obtaining key value pair information corresponding to the general information, and similarly, extracts state change information related to a service of the text to be extracted, performs part-of-speech layering with state descriptive words, and extracts a state change list through a [ action-service ] combination form. The accuracy of information extraction can be guaranteed, problems such as omission and errors of information extraction can be avoided, the situation of infinite backtracking possibly encountered in regular matching can be avoided by splitting the long text, the fault tolerance rate of codes is increased, and the time consumption of overall operation is reduced.

Example one

Specifically, as shown in fig. 1, taking the relevant disclosure text of the fund as an example, the process of extracting the information of the fund advertisement long text by using the method includes:

the method comprises the steps of firstly, obtaining an original long text sequence of information to be extracted, wherein the original long text sequence comprises a fund announcement long text;

specifically, the text to be extracted acquired here mainly includes related types of disclosure texts such as fund information disclosure recruitment specifications, fund contracts, and the like acquired from an official website disclosing disclosure information. It should be noted that the fund information disclosure collection instruction, the fund contract, and other related types of disclosure texts in the embodiment of the present invention are only an exemplary illustration and are not limited to the embodiment of the present invention, and besides the above long text, the method provided in the embodiment of the present invention may also be applied to information extraction of other long texts with a fixed directory structure.

Step two, configuring an extraction rule for extracting information of the bulletin long text;

specifically, the process mainly injects extraction rules for subsequent steps. The extraction rules include, but are not limited to, configuration file regular statements and external manual rule citations, regular expressions of the same extraction fields can be used in a superposed mode, the external manual rule citations are mainly used for configuring information such as fields needing to be extracted by a user, and the external manual rule citations can be imported in a form file format and can also be configured through a background operation and maintenance platform during specific implementation. It should be noted that, in the embodiment of the present invention, the extraction rule adopts a mode of combining multiple rules, so that the efficiency and accuracy of information extraction in a long text can be effectively improved.

Thirdly, positioning chapters where the directory information is located according to the file directory of the long text announced, and generating chapter information;

specifically, the method provided by the embodiment of the present invention mainly processes long texts with a fixed directory structure, and the directory information is usually the title information of each chapter. As a better example, when a chapter where the directory information is located, the directory information may be used as an extraction field, and the chapter where the field is located is automatically located and extracted through a preset chapter locating filtering function, so as to generate corresponding chapter information, where the chapter information includes titles and all contents of the chapter. In specific implementation, the chapter blocks of the Chinese document can be positioned through regular expressions.

Step four, dividing the chapter information to generate a corresponding paragraph list and a corresponding sentence list;

specifically, the chapter information generated in the above steps is further refined into a paragraph text block and a sentence text block, and a paragraph list and a sentence list are respectively generated. In a specific implementation, when a paragraph is divided, the text in the chapter can be divided into paragraphs according to the characteristics of the paragraph. Features of a chinese paragraph include, but are not limited to, a blank at the end of a paragraph line and an indentation at the beginning of a line. When the sentence is divided, the paragraphs generated in the front are further extracted according to the sentence characteristics, and the paragraphs are further divided into sentences. Sentence features include, but are not limited to, sentence end symbols, such as periods, exclamation marks, and the like.

Step five, extracting information of the paragraph list and the sentence list according to the extraction rule configured in the step two, and acquiring key value pair information corresponding to the long text of the bulletin;

specifically, the key in the key-value pair information is an extraction field defined in an extraction rule, and the value is related information extracted from the paragraph list and the sentence list according to the extraction rule according to the extraction field. Specifically, different information may be extracted in different ways. For example, in the fund profile, a part of the value is text descriptive information, so that the part of the information can be a word, a few words or a word, and adjacent words (or segments) are introduced for searching. For example, when the end of a certain paragraph or sentence corresponding to the extracted field is a colon, the content following the colon is usually the information to be extracted, and the paragraph or sentence following the colon is taken as the extracted target information. When the value to be extracted is also a specific type of short message, for example, the information to be extracted is a date, and such information is usually included in a sentence, the relevant information included in the sentence may be extracted by using the target detection method.

And step six, performing noise reduction processing on the key value pair information to obtain the processed key value pair information.

Specifically, a series of output values (i.e., key-value pair information) may be obtained through the previous steps. Although the segment corresponding to the information is accurately located, the output values may contain some noise and even some extraction confusion. In order to solve the problem, in the embodiment of the invention, a numerical noise reduction filter is introduced, and redundant or unreasonable results are further purified. The denoising process includes, but is not limited to, value type checking (for intra-sentence value cleaning), value truncation extraction (for inter-sentence information), and the like, which are not described herein again.

And seventhly, carrying out manual examination and verification on the key value pair information subjected to noise reduction processing, and storing the key value pair information passing through the manual examination and verification into a database.

Specifically, the information after manual review and verification can be used as fund basic information, a series of fund diagnosis and screening bases are provided, and data support is provided for internal and external platforms.

Specifically, when the steps are implemented specifically, the steps can be deployed on a pre-constructed big data cloud platform as a PySpark big data task, and the PySpark big data task is used for daily incremental processing fund information extraction tasks, and the output result is stored in a Hive table, so that a long text can be analyzed and explored in a measuring machine of several minutes in real time.

Specifically, as shown in fig. 2, taking a state change announcement text of the fund as an example, the embodiment of the present invention further provides an information extraction method for fund state change, which includes:

step R0: and extracting a corresponding sentence list from the state change notice text of the fund, analyzing the extracted sentence list and the notice title of the state change notice text, and acquiring an analysis result.

Specifically, the specific process of extracting the sentence list from the state change notice text of the fund may refer to the specific contents of the first step to the fourth step, which are not described herein again. It should also be noted that the status announcement text of the fund in the embodiment of the present invention is only an exemplary illustration and is not limited to the embodiment of the present invention, and besides the status change announcement text of the fund, the information extraction method of the status change provided in the embodiment of the present invention may also be applied to information extraction of other long texts with fixed directory structures.

Specifically, since some titles of the bulletin texts also include information to be extracted, the titles of the bulletin texts need to be considered together when performing the status change analysis. For example, a title of a certain bulletin text is "bulletin about suspended large-amount purchase, fixed-amount delivery and switching-in service of a certain money market fund", and "suspended large-amount purchase, fixed-amount delivery and switching-in service" in the title is also information to be extracted.

Step R1: performing action extraction on the analysis result to acquire action information;

specifically, in the embodiment of the present invention, the action information includes a bulletin text and an action-like word appearing in a title. The action-type words are mainly classified according to the part of speech, and the action-type words related to the business state change, which are common in the financial field, include open, pause, resume, and restriction.

Step R2: extracting the service from the analysis result to obtain service information;

specifically, in the embodiment of the present invention, the service information includes a service property noun appearing in the announcement text and the title, for example: procurement, redemption, commitment, conversion, and transfer to etc. Since the business change relates to the state and the amount of money, the embodiment of the invention also combines the business noun and some modifiers to obtain a new business word, and similarly, the state change such as 'large redemption' is carried out. In addition, the service also relates to commonly used phrases, aliases and abbreviations, and the embodiment of the invention also uniformly replaces the phrases, aliases and abbreviations in the step.

Step R3: generating state change information according to the action information and the service information;

specifically, the action and service phrases extracted in the above steps are arranged and combined, and a completed change list (i.e., state change information) is obtained by matching the state change enumeration values.

Step R4: and verifying the state change information, and storing the state change information in a database after the state change information passes the verification.

Specifically, when the state change information is stored in the database, the state change information may also be stored in a key value pair manner, and when the state change information is specifically implemented, the "state change" field is used as a key, and the extracted specific state change information is used as a value.

Specifically, in the embodiment of the present invention, the action and the service in the changed state are split, because the action or the service description in the state is omitted to a considerable extent in the long text of the fund text disclosure (for example, the state change of suspended purchase and redemption includes both suspended purchase and suspended redemption). The condition of incomplete extraction caused by information dislocation and information omission can be effectively relieved even avoided after the splitting.

Example two

Fig. 3 is a flowchart illustrating a text information extraction method according to an exemplary embodiment, and referring to fig. 3, the method includes the steps of:

s1: the method comprises the steps of obtaining a text to be extracted and an extraction rule corresponding to the text to be extracted, wherein the extraction rule comprises an extraction field.

Specifically, the text to be extracted includes, but is not limited to, fund information disclosure recruitment instructions, and long texts with fixed directory structures for fund contracts. It should be noted here that the information extraction method provided by the embodiment of the present invention may also be applied to information extraction of long texts with other structural style comparison specifications. The extraction rules comprise configuration file regular sentences and custom rules, wherein the custom rules are mainly used for configuring information such as fields and the like required to be extracted by a user, and the custom rules can be adjusted according to actual requirements of the user, so that different information extraction requirements are met. The extraction rule adopts a mode of combining multiple rules, so that the efficiency and the accuracy of information extraction in a long text can be effectively improved.

S2: and determining the chapter position of each directory information in the file directory in the text to be extracted according to the file directory of the text to be extracted, and generating chapter information.

In particular, some documents typically have a relatively fixed template structure, such as a directory structure. In the embodiment of the invention, the document directory of the document to be extracted is utilized to accurately position the document to chapter, paragraph and sentence levels in a directory hierarchical positioning mode, so as to prepare for subsequent information extraction. When the chapters are located, the directory information can be used as an extraction field (the directory information is usually the title of each chapter), the chapter where the field is located is automatically located and extracted in a regular expression mode, and corresponding chapter information is generated.

S3: dividing the chapter information according to a preset rule to generate a corresponding division list;

specifically, in order to improve the accuracy of information extraction, after the chapters of the text to be extracted are located, the chapter information needs to be further subdivided, the chapter information corresponding to each chapter is divided into paragraphs, the paragraphs are then sequentially divided into sentences, and a paragraph list and a sentence list are respectively generated according to the division results for use in the subsequent steps. The specific dividing process may refer to the content recorded in the relevant steps in the first embodiment, and is not described herein again.

S4: and generating key value pair information corresponding to the text to be extracted according to the division list and the extraction rule, and storing the key value pair information into a database, wherein the key comprises an extraction field, and the value comprises the division list and target information corresponding to the extraction field.

Specifically, information extraction is carried out on the obtained division list according to the extraction field and the extraction rule, information needing to be extracted is obtained, then key value pair information is generated by the extraction field and the extracted information, and the key value pair information is stored in a database.

As a preferred implementation manner, in an embodiment of the present invention, the dividing list includes a paragraph list and a sentence list, and the dividing the chapter information according to a preset rule to generate a corresponding dividing list includes:

Specifically, the paragraph division and sentence division process can refer to the content recorded in the relevant steps in the first embodiment, and will not be described herein again.

As a preferred implementation manner, in an embodiment of the present invention, when target information corresponding to an extraction field is long text information, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:

Specifically, when the target information corresponding to the extracted field is long text information, such as in the fund profile information, part of the information is text descriptive information, and therefore this part of the information may be a word, several words, or a word, and adjacent words (or segments) may be introduced for searching, that is, adjacent words or segments are also taken into consideration. For example, when the end of a certain paragraph or sentence corresponding to the extracted field is a colon, the content following the colon is usually the information to be extracted, and the paragraph or sentence following the colon is taken as the extracted target information.

As a preferred implementation manner, in the embodiment of the present invention, when the target information corresponding to the extracted field is short text information, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:

Specifically, when the target information corresponding to the extracted field is short text information, for example, when the information to be extracted is a date, since such information is usually mixed in a sentence, at this time, the relevant information included in the sentence may be extracted in a target detection manner.

As a preferred implementation manner, in an embodiment of the present invention, when an extraction field is changed in state, the generating, according to the division list and the extraction rule, key-value pair information corresponding to the text to be extracted and storing the key-value pair information in a database includes:

Specifically, the process of extracting the state change information may refer to the content of the process of extracting the fund state change information in the first embodiment, and details are not described here.

As a preferred implementation manner, in an embodiment of the present invention, before storing the key-value pair information in the database, the method further includes:

Specifically, in order to improve the accuracy of information extraction, the key value pair information generated in the above steps is further filtered, and in specific implementation, the key value pair information may be subjected to noise reduction processing to remove redundant or unreasonable results.

As a preferred implementation manner, in the embodiment of the present invention, the extraction rule includes a regular expression.

Fig. 4 is a schematic structural diagram illustrating a text information extraction apparatus according to an exemplary embodiment, and referring to fig. 4, the apparatus includes:

As a preferred implementation manner, in an embodiment of the present invention, the data dividing module includes:

the paragraph dividing unit is used for carrying out paragraph division on each chapter information according to preset paragraph characteristics and respectively generating corresponding paragraph lists;

and the sentence dividing unit is used for carrying out sentence division on each paragraph in each paragraph list according to preset sentence characteristics and respectively generating corresponding sentence lists.

As a preferred implementation manner, in an embodiment of the present invention, the information generating module is specifically configured to:

As a preferred implementation manner, in an embodiment of the present invention, the information generating module is further configured to:

As a preferred implementation manner, in an embodiment of the present invention, the apparatus further includes:

and the noise reduction processing module is used for carrying out noise reduction processing on the key value pair information and storing the key value pair information subjected to noise reduction processing into a database.

Fig. 5 is a schematic diagram illustrating an internal configuration of a computer device according to an exemplary embodiment, which includes a processor, a memory, and a network interface connected through a system bus, as shown in fig. 5. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of optimization of an execution plan.

Those skilled in the art will appreciate that the configuration shown in fig. 5 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing devices to which aspects of the present invention may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

As a preferred implementation manner, in an embodiment of the present invention, the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the following steps when executing the computer program:

As a preferred implementation manner, in the embodiment of the present invention, when the processor executes the computer program, the following steps are further implemented:

In an embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the following steps:

As a preferred implementation manner, in the embodiment of the present invention, when executed by the processor, the computer program further implements the following steps:

In summary, the technical solution provided by the embodiment of the present invention has the following beneficial effects:

It should be noted that: the text information extraction device provided in the foregoing embodiment is only illustrated by dividing the functional modules when triggering the extraction service, and in practical applications, the function allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the text information extraction device and the text information extraction method provided by the above embodiment belong to the same concept, that is, the device is based on the text information extraction method, and the specific implementation process thereof is described in the method embodiment and is not described herein again.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A text information extraction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the division list includes a paragraph list and a sentence list, and the dividing the chapter information according to a preset rule to generate the corresponding division list includes:

3. The method according to claim 1 or 2, wherein when the target information corresponding to the extracted field is long text information, the generating key-value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key-value pair information into a database comprises:

4. The method according to claim 1 or 2, wherein when the target information corresponding to the extracted field is short text information, the generating key-value pair information corresponding to the text to be extracted according to the division list and the extraction rule and storing the key-value pair information into a database comprises:

5. The method of claim 2, wherein when the extraction field is changed in state, the generating key-value pair information corresponding to the text to be extracted according to the partition list and the extraction rule and storing the key-value pair information into a database comprises:

6. The text information extraction method according to claim 1 or 2, wherein before storing the key-value pair information in the database, the method further comprises:

7. The text information extraction method according to claim 1 or 2, wherein the extraction rule includes a regular expression.

8. A text information extraction apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.