CN110321432B

CN110321432B - Text event information extraction method, electronic device and nonvolatile storage medium

Info

Publication number: CN110321432B
Application number: CN201910548427.8A
Authority: CN
Inventors: 乔春庚; 江敏; 刘瑞宝
Original assignee: Tols Information Technology Co ltd
Current assignee: Tols Information Technology Co ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2021-11-23
Anticipated expiration: 2039-06-24
Also published as: CN110321432A

Abstract

The invention belongs to the technical field of information processing, and provides a text event information extraction method in a first aspect in order to solve the technical problem that the accuracy rate is low in the technical scheme of event information extraction in the prior art, wherein the method comprises the following steps: performing word segmentation on the text, performing vector conversion on the segmented words to obtain word vectors, inputting the word vectors into a neural network model, and outputting an entity; based on the information type defined by the text format characteristics, arranging the participles and the entities in the text block into a structured text block according to the corresponding mode rule defined by the grammar; and extracting event information from the structured text block, extracting keywords by using corresponding mode rules defined by the grammar, and outputting the keywords to a result template. Therefore, the event extraction model is configured by combining the neural network deep learning and the rule, and the text event information is accurately extracted.

Description

Text event information extraction method, electronic device and nonvolatile storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to the field of event information extraction technologies in text mining research, and in particular, to a text event information extraction method, an electronic device, and a non-volatile storage medium.

Background

Event information extraction is one of the most challenging tasks in text mining research, aims to automatically extract specific types of events and elements thereof from texts by using a computer, is a key technology in the field of information processing, and is widely applied to the fields of information retrieval, automatic question answering, automatic summarization, data mining, text mining and the like.

Event information extraction current research and experiments, summarized, mainly have three categories: (1) rule-based text event extraction, typical systems applying such methods are: ex Disco, Gen PAM, and the like. (2) The method comprises the following steps of extracting a text event based on trigger word detection, wherein the core of the method is the trigger word detection and the determination of event elements and roles thereof, and the trigger word is a word capable of well expressing the central meaning of a certain type of event; for example, terms such as "appointments", "jobs" and the like in the job change event. (3) Extraction of text information based on a probabilistic statistical model, for example, extracting all fields of header information of a computer research paper by using a hidden Markov model.

Although a statistical model is used for a lot of researches on information extraction in the text, data fields to be extracted in the researches can be regarded as a very compact sequence, the expression of events in the text does not have the characteristic, the data fields to be extracted are scattered and sparse, and some data fields to be extracted are even away from an event expression center (which can be regarded as the position of a trigger word) by a certain distance; thus, the accuracy rate is still to be improved.

Disclosure of Invention

In order to solve the technical problem of low accuracy in the technical scheme of event information extraction in the prior art, the invention provides a text event information extraction method, an electronic device and a nonvolatile storage medium.

In order to achieve the above object, the technical solution provided by the present invention comprises:

the invention provides a text event information extraction method in a first aspect, which is characterized by comprising the following steps:

preprocessing a text, wherein the preprocessing comprises the steps of segmenting the text into words, carrying out vector conversion on the segmented words to obtain word vectors, inputting the word vectors into a neural network model, and outputting an entity through the neural network model;

the method comprises the steps of carrying out blocking processing on a text to obtain text blocks, and carrying out classification and extraction processing on the text blocks, wherein the classification and extraction processing on the text blocks comprises the following steps: based on the information type defined by the text format characteristics, arranging the participles and the entities in the text block into a structured text block according to the corresponding mode rule defined by the grammar;

and performing event information extraction processing on the text block after the text is structured, wherein the event information extraction processing comprises the steps of extracting keywords by using corresponding mode rules defined by the grammar, and outputting the keywords to a result template corresponding to the event information extraction.

In a preferred implementation manner of the embodiment of the present invention, when the text is a personal resume, the personal resume is processed by blocks and includes a first text block corresponding to basic information, a second text block corresponding to an educational experience, a third text block corresponding to a work experience, a fourth text block corresponding to a training experience, a fifth text block corresponding to a qualification certificate, and a sixth text block corresponding to a job-seeking intention; the format characteristics respectively comprise information characteristics corresponding to basic information, education experiences, work experiences, training experiences, qualification certificates and job seeking intentions; the grammar-defined pattern rules include judgment rules defined according to lexical analysis, syntactic analysis and semantic analysis in the compilation principle.

In a preferred implementation manner of the embodiment of the present invention, the segmenting and dividing the text includes: in the organization of the core dictionary, a method of a double array trie tree is adopted; aiming at the intersection type word segmentation ambiguity, a method combining rules and statistics is adopted; aiming at the recognition of unknown words, a recognition method based on a conditional random field is adopted.

In a preferred implementation manner of the embodiment of the present invention, the deep neural network model includes an Embedding layer, a bidirectional RNN layer, and a CRF layer; and the Embedding layer performs vector conversion on the participles to obtain word vectors, and the word vectors are sequentially sent to the bidirectional RNN layer to obtain the probability distribution of participle labels, and the probability distribution of the participle labels is sent to the CRF layer to obtain entity label sequences corresponding to entities.

In a preferred implementation manner of the embodiment of the present invention, the mode rule is modifiable, and the configuration information extraction model of the rule can be configured according to different application scenarios; and the mode rule is provided with the category information, the text preprocessing also comprises the analysis of the text inline context rule, the analysis of the text inline context rule comprises the results of word segmentation and entity recognition on the text, the word segmentation result is corrected by adopting a preset rule correction method, and ambiguous categories are re-identified.

In a preferred implementation of the embodiment of the present invention, the pattern rule includes a simplified merge rule, and a complex long rule is placed in front of the simplified merge rule, and a simple short rule is placed in the back of the simplified merge rule.

In a preferred implementation manner of the embodiment of the present invention, the sorting the participles and the entities in the text block into the structured text block according to the corresponding mode rules includes: and judging whether the continuous lines accord with a specific mode or not by adopting a mode rule sequence, and storing the mode rule result matched with each line into the multi-dimensional array of the character string type after the text accords with the corresponding specific mode.

In a preferred implementation manner of the embodiment of the present invention, the rule configuration information extraction model is expressed and written by using a rule description language NPRDL, and the NPRDL language uses a BNF paradigm; and the description language is a means for describing the syntactic semantic information of the vocabulary based on a complex feature set, and is also described in dynamic analysis by using a dynamic attribute table described based on the complex feature set.

The second aspect of the present invention also provides an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method as any one of the methods provided in the first aspect.

A third aspect of the present invention also provides a non-volatile storage medium having a computer program stored thereon, wherein the computer program is adapted to perform the steps of any of the methods as provided in the first aspect when executed.

By adopting the technical scheme provided by the invention, the following beneficial effects can be obtained:

1. the method comprises the steps of performing word segmentation and entity recognition on a text by utilizing a mathematical model of a neural network, rapidly obtaining basic elements in the text, extracting text block classification information of the text in a mode of combining with a mode rule, arranging words and entities in the text block into a structured text block according to a corresponding mode rule defined by a grammar, structuring the text information according to a grammar expression required by a computer language in a mode of being more beneficial to information extraction, and extracting an event on the basis of text block structuring processing, so that the problems of data dispersion, sparseness and extraction domain being far away from an event expression center are effectively solved, and text event information is accurately extracted; and the entity recognition comprises the recognition by adopting a deep neural network model, so that the recognition effect is improved.

2. As a preferred implementation mode, the pattern rule comprises the generic information, so that the pattern rule is more convenient to use by introducing an external customizable generic dictionary.

3. The basis of the pattern rules is modifiable, for example, written in expression using the rule description language NPRDL, which uses the BNF paradigm; and aiming at different application scenes, the information extraction model can be flexibly and quickly configured.

4. Putting a complex long rule in front and a simple short rule in back; since rule matching is carried out from front to back, the technical problems that matching with the former rule is successful firstly and matching fails because the latter rule is not traversed are avoided.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and/or process particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

Fig. 1 is a flowchart of a text event information extraction method according to an embodiment of the present invention.

Fig. 2 is a flowchart of text preprocessing in a text event information extraction method according to an embodiment of the present invention.

Fig. 3 is a flowchart of classifying and extracting text blocks in a text event information extraction method according to an embodiment of the present invention.

Fig. 4 is a flowchart of event information extraction in a text event information extraction method according to an embodiment of the present invention.

Fig. 5 is a block diagram of a text event information extraction apparatus according to an embodiment of the present invention.

Fig. 6 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that the detailed description is only for the purpose of making the invention easier and clearer for those skilled in the art, and is not intended to be a limiting explanation of the invention; moreover, as long as there is no conflict, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are all within the scope of the present invention.

Additionally, the steps illustrated in the flow charts of the drawings may be performed in a control system such as a set of controller-executable instructions and, although a logical ordering is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.

The technical scheme of the invention is described in detail by the figures and the specific embodiments as follows:

examples

In order to solve the technical problem that the technical scheme of extracting the event information in the prior art is difficult in model establishment or low in accuracy, the embodiment provides an event extraction method and device, a neural network is used for preprocessing a text, then the text is partitioned, and corresponding event information is extracted from different text blocks.

As shown in fig. 1, the present embodiment provides a text event information extraction method, including:

s110, preprocessing the text, wherein the preprocessing comprises word segmentation of the text, vector conversion of the word segmentation to obtain a word vector, inputting the word vector to the neural network model, and outputting an entity through the neural network model.

The text preprocessing in this embodiment further includes, before performing word segmentation and entity recognition on the text, text format alignment: the continuous space is removed and kept one, and a plurality of keywords are included: the rows of the "+" slot information "structure are divided into single rows; all the full-size characters appearing in the text are converted into half-size characters, and all the capital English letters appearing in the text are converted into lowercase English letters.

In a preferred embodiment of this embodiment, the segmenting the text includes: in the organization of the core dictionary, a method of a double array trie tree is adopted; aiming at the intersection type word segmentation ambiguity, a method combining rules and statistics is adopted; aiming at the recognition of unknown words, a recognition method based on a conditional random field is adopted.

As shown in fig. 2, in a preferred embodiment of this embodiment, the deep neural network model includes an Embedding layer, a bidirectional RNN layer, and a CRF layer; and the Embedding layer performs vector conversion on the participles to obtain word vectors, the word vectors are sequentially sent to the bidirectional RNN layer to obtain the probability distribution of participle labels, and the probability distribution of the participle labels is sent to the CRF layer to obtain entity label sequences corresponding to the entities. That is, in the preferred embodiment of this embodiment, a method of combining word segmentation with dictionary word segmentation and statistical word segmentation is adopted:

1) in the aspect of organization of a core dictionary, the characteristics of time efficiency, storage space efficiency, Chinese statistical rules and the like of dictionary lookup are considered, and a method of a double-number-group trie tree is adopted; the Double-Array Trie is called Double Array Trie in English, is a simple and effective implementation of the Trie tree, and is composed of two integer arrays, one is base [ ], and the other is check [ ]; setting the index of the array as i, if base [ i ] and check [ i ] are both 0, indicating that the position is empty; if base [ i ] is a negative value, the state is a word; check [ i ] represents a state immediately preceding the state, t ═ base [ i ] + a, Check [ t ] ═ i.

2) The ambiguity elimination and the unknown words are two major difficulties of Chinese word segmentation, and a method combining rules and statistics is adopted for intersection type word segmentation ambiguity.

3) And aiming at the identification of unknown words, an identification method based on a conditional random field is adopted.

4) And aiming at part-of-speech tagging, a method based on a hidden Markov model is adopted.

Specifically, the entity identification is to identify and label all dates, times, phone calls, names of people, places, names of organizations, etc. appearing in the text (i.e., the entity can also be considered as belonging to a particular processed word segmentation). The method comprises the steps that entity recognition is carried out by adopting a deep learning-based method, an entity recognition task is taken as a classification problem based on sequence marking, and a deep neural network used by the method mainly comprises an Embedding layer (mainly comprising word vectors, character vectors and some additional characteristics), a bidirectional RNN layer, a TANH hidden layer and a final CRF layer; here, the RNN is usually LSTM or GRU. As shown in fig. 2, the text preprocessing specifically includes:

1) and dividing the text into single words, English words, numbers and punctuation marks, and constructing word vectors at an Embedding layer.

2) And constructing a word vector and then sending the word vector into a bidirectional LSTM model, wherein the LSTM model can output a segmentation label sequence according to an input sequence.

3) And obtaining the probability distribution of the word segmentation labels by using a Soft max function at the output end of the LSTM.

4) Sending the probability distribution of the segmented label sequence into a CRF model to obtain an optimal entity label sequence; the term "optimal" is mentioned here only to indicate that the results of the entity tag sequence obtained by the CRF model are good and is not particularly limited.

In a preferred embodiment of this embodiment, the text preprocessing further includes a text inline context rule analysis, where the text inline context rule analysis includes results of performing word segmentation and entity recognition on a text, and the word segmentation results are corrected by using a predetermined rule correction method to re-identify ambiguous categories. And generic information may be used for the schema rules mentioned below.

Since the existing word segmentation and entity recognition cannot achieve 100% accuracy, the word segmentation result needs to be corrected by a preset rule correction method, and ambiguous categories need to be re-identified. Specific implementations of the predetermined rule correction method, such as the use of ngram models or custom grammar rules to disambiguate ambiguous chinese segmentations.

In this embodiment, the main task of named entity recognition is to recognize and classify the proper names such as the names of people and places and the quantitative phrases such as meaningful time and date in the text.

It should be noted that the texts mentioned in the present embodiment include, but are not limited to: the content is text in plain text format, or text with editable text and words, and the representation of the text includes, but is not limited to, a web page format, a server-stored document format, and the like.

S120, the text is subjected to block processing to obtain text blocks, and the text blocks are subjected to classification and extraction processing, wherein the classification and extraction processing of the text blocks comprises the following steps: based on the information type defined by the text format characteristics, arranging the participles and the entities in the text block into a structured text block according to the corresponding mode rule defined by the grammar.

In a preferred embodiment of the present invention, when the text is a personal resume, the personal resume is processed by blocks and includes a first text block corresponding to basic information, a second text block corresponding to an educational experience, a third text block corresponding to a work experience, a fourth text block corresponding to a training experience, a fifth text block corresponding to a qualification certificate, and a sixth text block corresponding to job hunting will; the format characteristics respectively comprise information characteristics corresponding to basic information, education experiences, work experiences, training experiences, qualification certificates and job hunting intentions; the grammar-defined pattern rules include judgment rules defined according to lexical analysis, syntactic analysis, and semantic analysis in the compilation principle.

In a preferred embodiment of this embodiment, the mode rule is modifiable, and the configuration information extraction model of the rule can be configured according to different application scenarios; and the mode rule is provided with the category information, the text preprocessing also comprises the analysis of the in-line context rule of the text, the analysis of the in-line context rule of the text comprises the results of word segmentation and entity recognition of the text, the word segmentation result is corrected by adopting a preset rule correction method, and ambiguous categories are re-identified. And a pattern rule may be understood as a rule corresponding to an extraction event information extraction model that is adjusted according to a different pattern (or application scenario) (or different rules based on different scenarios are used). The rule in the embodiment may be a conditional statement expression, or may be a variety of input/output mapping tables, and the like, and the recognition of different modes may be confirmed based on the application scenario of the text itself, or may be realized by automatically recognizing a keyword in the text; the present embodiment is not limited thereto, and these different embodiments all belong to the protection scope of the present embodiment.

In a preferred embodiment of the present embodiment, the pattern rule includes a simplified merge rule, with a complex long rule placed in front and a simple short rule placed in the back. Putting a complex long rule in front and a simple short rule in back; since rule matching is carried out from front to back, the technical problems that matching with the former rule is successful firstly and matching fails because the latter rule is not traversed are avoided.

In a preferred embodiment of this embodiment, the sorting the participles and the entities in the text block into the structured text block according to the corresponding mode rules includes: and judging whether the continuous lines accord with a specific mode or not by adopting a mode rule sequence, and storing the mode rule result matched with each line into the multi-dimensional array of the character string type after the text accords with the corresponding specific mode. As explained in further detail below in conjunction with fig. 3.

The text block classification extraction processing is established on the basis of a modifiable mode rule, and the rule for judging the information type is defined aiming at the text format characteristics, and the mode is the format characteristic set of various types of information. Such pattern rules can be derived from a defining grammar, the interpretation of which can borrow lexical analysis, syntactic analysis, and semantic analysis methods from the compilation principles.

As a preferred embodiment of the present embodiment, the pattern rule includes the category information, so that the use of the pattern rule is more convenient by introducing an external customizable category dictionary.

In this embodiment, the schema rules formally define:

< mode name > < genus 1> [ &, |, -, < lambda >, (,) ] < genus 2> [ &, |, -, < lambda >, (,) ] … [ &, |, -, < lambda >, (,) ] < genus n >

The right part is an expression composed of AND (&), or (|), NOT (-), XOR (^), bracket ((, -), etc., defined by the regular grammar.

The blocking rule is used for sequentially judging whether the continuous lines conform to a specific mode or not;

for example: x _ y _0_ S _1_ z; x _ y _1_ M _ n _ z; x _ y _2_ E _1_ N- > SUCCESS, the meaning of which is: x _ y _0_ S _1_ z, x _ y _1_ M _ N _ z, x _ y _2_ E _1_ N are the category names in the partitions respectively, the rules of which are composed of vocabulary categories and operators, & is logical AND, | is logical OR, | is logical NOT, as follows:

x_y_0_S_1_z:“key1_&loc-key0”

the meaning of rule matching is: "if both key1_ and loc categories appear in the text line and key0 category does not appear, then the x _ y _0_ S _1_ z classification mode is met".

The meaning of the class name is:

x is the priority number of each information category matching, starting from (0 … …)1, the lower the priority number is, the higher the priority is;

y is the subscript of the rule, starting with (0 … …)0 and also indicating the priority number of matches in the same information category;

the 1 st bit after x _ y _ is the sequence number of the rule sub-condition, the sequence number is increased from 0, and the sequence number is limited to be less than 5 for reducing the complexity;

bit 2 is the relative position of the rule sub-conditions, 'S' is the start, 'M' is the middle, 'E' is the end;

and the 3 rd bit is a natural number of 1-9 or n-unlimited, and the number of matched rows is limited by the rule sub-condition.

z indicates the information category to be marked for the row, starting with (0 … …)1, with N being the default event information category.

If the k-th row matches x _ y _0_ S _1_ z, the k + 1-th to k +1+ N-th rows match x _ y _1_ M _ N _ z, and the k + 1-th + N + 1-th row matches x _ y _2_ E _1_ N, the k-th to k +1+ N + 1-th rows match successfully, marking the information category of the k-th to k +1+ N-th rows as 1.

The values of x, y, z should be within respective predefined ranges, otherwise the program will stop chunking.

To facilitate regular expression, a "SYS _ TRUE" is defined to mean logically TRUE.

Simplifying the merge rule can solve the overlapping and redundancy of the rule; the pattern rules should have complex long rules in front and simple short rules in the back. Specifically, as shown in fig. 3, the text block classification and extraction process includes:

s121, start, the text block classification extraction process is executed.

And S122, mode classification, namely judging which mode rule can be met by the current text line according to a preset mode rule, namely matching the mode rule ': the right part'.

S123, judging whether the text is finished or not matched with any rule, namely whether all lines of the current text are executed or not or whether the lines of the current text are not matched with any rule, if so, executing the step S124, otherwise, returning to execute the step S122.

S124, analyzing the text line by the pattern matched on each line, and storing the matched pattern rule 'left part' (namely the classification result of x _ y _0_ S _1_ z) into the three-dimensional array Cn [ a ] [ b ] [ c ]. Subscript a corresponds to the 1 st digit x of the category (i.e., the priority number of each information category match), subscript b corresponds to the 2 nd digit y of the category (i.e., the rule subscript), and subscript c corresponds to the 3 rd digit of the category (i.e., the serial number of the rule sub-condition). Cn [ a ] [ b ] [ c ] stores the last 3 bits of the rule condition. Cn [ a ] [ b ] [ c ] [0] stores the relative position of the rule sub-condition, Cn [ a ] [ b ] [ c ] [1] stores the defined matching line number, and Cn [ a ] [ b ] [ c ] [2] stores the information type to be marked of the line. The step has completed the rule matching of all lines of text and the classification results are saved in Cn [ a ] [ b ] [ c ].

S125, judging whether Cn [ a ] [ b ] [ c ] is text end: reading Cn [ a ] [ b ] [ c ] from n to 0, and accumulating n until the text is finished; if Cn [ a ] [ b ] [ c ] is NULL, indicating the end of the text, it jumps to S129.

S126, judging the value of Cn [ a ] [ b ] [ c ] [0 ]: if "E" indicates that the rule matching of the category is finished, jumping to S128; if "S" or "M" indicates that the rule of the category matches the beginning or middle, it jumps to S127.

S127, judging whether the value of b reaches the boundary: if b reaches the boundary, which indicates that the class matching is finished, the line is not classified, and the n +1 lines are directly processed. If b does not reach the boundary, b +1 and c +1 continue to process the rules behind the category, and jump to S126 to continue judgment.

S128, specifying the classification result of the row: matching multiple rows from "S" to "E" is successful, or matching a single row "E" is successful, merging the rows and marking the partition category as Cn [ a ] [ b ] [ c ] [2 ]. Then n +1, the next line is processed.

And S129, finishing matching.

And S130, respectively carrying out event information extraction processing on each type of text block after the text is structured, wherein the event information extraction processing comprises the steps of extracting keywords by using corresponding mode rules defined by a grammar, and outputting the keywords to a result template corresponding to the event information extraction.

Therefore, the text event information extraction method provided by this embodiment performs word segmentation and entity recognition on a text by using a mathematical model of a neural network, can quickly obtain basic elements in the text, performs text block classification information extraction on the text in combination with a mode of a mode rule, and arranges the words and entities in the text block into a structured text block according to a corresponding mode rule defined by a grammar, so that the text information is structured according to a grammar expression required by a computer language in a mode more beneficial to information extraction, and performs event extraction on the basis of text block structured processing, thereby effectively solving the problems of data dispersion, sparseness and extraction domain further from an event expression center, so that the text event information is accurately extracted; and the entity recognition comprises the recognition by adopting a deep neural network model, so that the recognition effect is improved.

In a preferred embodiment of this embodiment, the rule configuration information extraction model is written in a rule description language NPRDL, and the NPRDL language uses a BNF paradigm; and the description language is a means for describing the syntactic semantic information of the vocabulary based on the complex feature set, and is also described in the dynamic analysis by using a dynamic attribute table described based on the complex feature set.

Therefore, as a preferred implementation of this embodiment, the basis of the pattern rule is modifiable, for example, written by using a rule description language NPRDL, which uses the BNF paradigm; and aiming at different application scenes, the information extraction model can be flexibly and quickly configured. In addition, the complex long rule is placed in the front, and the simple short rule is placed in the back; since rule matching is carried out from front to back, the technical problems that matching with the former rule is successful firstly and matching fails because the latter rule is not traversed are avoided.

Specifically, the specific implementation manner of extracting various types of keywords of the event from the text block is as follows: and identifying by using the format characteristics of various types of information, mainly the organization characteristics of various types of information, and adopting a rule method. For example: identifying "school", one scenario that may be utilized is "site" + "school" (or "university", "college", etc.), examples of which are: examples of "Beijing university" and the like, in which "unit name" is recognized and "employment" (or "employment" and the like) plus other information + "company" (or "group", "office", and the like) are available: "job in the TRS company of Beijing".

More specifically, the event information extraction processing of the text block in the text includes:

one), rule description language NPRDL

The rules in the system are expressed and written by using a rule description language NPRDL, and the NPRDL adopts a BNF paradigm. The basic unit of NPRDL language is < rule >, a < simple rule > actually corresponds to a simple "conditional sentence" of natural language, whose basic form is:

< test > < operation >

The significance is as follows: if < test > succeeds, then < operation > is performed.

< test > and < operation > are divided by function, and since both involve the same object (i.e., analyze the unified input sentence), they can be unified into the form of the following structure diagram:

in actual analysis, < structural formula > can be used to express a segment of the analysis statement, where < structural terms > correspond to 'words' in the statement. A sequential structure consisting of < tag > < operation > < element > can be used to express the test or operation on this term property.

The description object of NPRDL language is a chinese sentence with words as basic object units and its intermediate structure (such as syntax tree, concept semantic network, etc.). Which comprises the following steps:

1. the information for each word is represented as: concept (attribute 1, attribute 2, …, attribute n). Wherein concepts are represented by the words themselves.

2. During the analysis process, the attributes (such as phrase class, sentence class, tone, etc.) of related phrases and clauses appearing with a certain word as a central word can be attributed to the central word.

3. Concepts and relationships between concepts may be expressed as:

relationships between

Concept- > concept

A sentence (or a portion thereof) can be represented as a < word > sequence of the form: < word > + … + < word >. In the description of the rules, the above formula is called < structural formula >, where each < word > is called a < structural term >. The < structural formula > corresponding to the < word > sequence is: < structure item > + … + < structure item >.

(1) Each < structure item > corresponds to a < word >, and the content comprises:

< item tag >: to indicate the position of the word in the sentence.

< one operation >: to indicate operations on the word related to the property.

< item element >: to express certain attributes of the word.

(2) The symbol '+' is a structure connector, indicating that its two < structure items > before and after it have adjacency and order.

(3) Two < structural items > with a parent-child relationship represent the two in different hierarchies with ↓ (representing a parent) or ↓ (representing a child).

Example (c): ^ (VV,2033) + ↓ # (NN,111, SUBJECT) ═ > (^ ↓ #, GRELA: ^ AGT)

The significance is as follows: if the current word is (verb, thought activity) and a son of the current word is (noun, person, subject of the upper verb), the lattice relationship of the son of the current word is modified into the scholars lattice.

The rule description language has three characteristics:

1. the description language is a means based on a complex feature set to describe the syntactic semantic information of the vocabulary, and a dynamic attribute table based on the complex feature set description is used for describing in dynamic analysis, so that the information of Chinese text analysis units can be described from multiple levels and multiple aspects.

2. The method is convenient for computer processing, the description language provides rich tree and network atomic operation actions, the syntax tree and semantic operation formed in computer processing analysis is very convenient, and meanwhile, the query, modification and deletion of static attribute table information also provide rich atomic operation actions.

3. The rule is convenient to write, and the rule language not only has strong description capability, but also has very detailed description. For many language phenomena with smaller description granularity, the language phenomena can be completely described by using a unique description method of words, parts of speech, voice classification codes and the like. And because the rules are designed by adopting a description idea, the writing is visual and convenient for users.

Two), context analysis rules

The rules for keyword extraction may be expressed in terms of Bax as follows:

rule: ═ test ═ action >

Testing, wherein the test formula is shown in the specification; { test formula; }

The test formula is n, the test item { (& shu # shu | +) test item } | n, and { & test item }

Test item: ═ attribute (═ |! =) 'attribute value'

Attribute of lex class

Property value::::::::::: |, generic features

The action is defined as action formula {; action type

The action formula is n, n, the generic characteristics and the part of speech

Wherein:

(1) denotes the rule start symbol

(2) -representing a certain node pointer structure

(3) n represents natural number, and the specific number represents the number of nodes and is 1-9

(4) lex and class indicate that the attribute value behind is a specific entry or a specific category, and if the attribute value behind is lex, the attribute value behind is a specific 'entry'; if class, the attribute value thereafter is a specific 'generic'.

(5) The generic features represent the generic classes in the dictionary, such as: org, and the like.

Third), examples of specific applications

One corresponding pattern rule for analyzing the birth date (birthday) is as follows:

1, lex ═ originated from'; 1, class ═ time'; 2,2, birthday, n

Meaning that the term of node 1 is "birth from", and the category of node 2 is "time", the category of node 2 is identified as "birthday".

Four), consolidation and organization of pattern rules

Rules are gradually inflated and may overlap, so there is redundancy, and simplifying merge rules may solve this problem, for example:

^1,class＝'time'；n,～&lex！＝'￥'；1,class＝'from'；＝>1,1,106,n

and ^1, class ═ time'; n, & lex! -a'; 1, class ═ to'; 1,106, n

The two rules can be combined into one rule, the property of the pattern rule grammar is utilized to satisfy the combination law and the distribution law, and the combined rule is as follows:

^1,class＝'time'；n,～&lex！＝'￥'；1,class＝'from'|class＝'to'；＝>1,1,106,n

this makes the logic more strict.

Since rule matching is performed from front to back, matching with the previous rule may be successful first, while matching failure due to no traversal in the later stage may sometimes cause errors. The pattern rules should place the complex long rules in front and the simple short rules in the back.

As shown in fig. 4, in this embodiment, the event information extraction in the text event information extraction method includes:

s131, confirming the text and extracting the type, for example, based on the aforementioned rule for determining the type of information defined for the text format feature.

And S132, opening a rule file, namely opening a file where the configuration information extraction model is located.

S133, acquiring a line of character string from the text.

And S134, text preprocessing, including word segmentation and entity recognition of the text.

And S135, loading the character strings into a text container, and storing the character strings after word segmentation and entity recognition according to a preprocessing format.

And S136, using the mode rule to extract the blocks, namely performing text block division and text block classification extraction processing on the text according to the mode mentioned in the aforementioned S120.

And S137, analyzing and labeling various keywords by using a rule interpreter according to the block categories.

And S138, outputting the keywords to the result template.

S139, judging the text is finished, and if so, standardizing or outputting the text (S140); otherwise, the return is S133.

As shown in fig. 5, the present embodiment further provides a text event information extraction apparatus 100, where the text event information extraction apparatus 100 includes:

the text preprocessing module 110 is configured to preprocess the text, where the preprocessing includes performing word segmentation on the text, performing vector conversion on the word segmentation to obtain a word vector, inputting the word vector to the neural network model, and outputting an entity through the neural network model.

The text block classification extraction processing module 120 is configured to perform block processing on the text to obtain text blocks, and perform text block classification extraction processing, where the text block classification extraction processing includes: based on the information type defined by the text format characteristics, arranging the participles and the entities in the text block into a structured text block according to the corresponding mode rule defined by the grammar.

And the event information extraction processing module 130 is configured to perform event information extraction processing on the text block after the text block is structured, wherein the event information extraction processing includes extracting keywords by using corresponding mode rules defined by the grammar, and outputting the keywords to a result template corresponding to the event information extraction.

It should be noted that the specific process of text processing in the text event information extraction device 100 provided in this embodiment is the same as the text event information extraction method using statistics and rules, and can also obtain the same technical effect; and will not be described in detail herein.

In order to make it easier for those skilled in the art to understand the technical solution of the present embodiment, taking extracting event information corresponding to a resume text as an example, the text information extraction is specifically described below; suppose that six items of basic information, educational experience, work experience, training experience, qualification certificate, job-seeking will, and the like need to be extracted, and dozens of items of attribute information are required. The specific text event information extraction comprises the following steps:

one), text preprocessing

1. Defining event information extraction content

InfoTYPE1TResume _ TRS # basic information

1_00IgnoreFlag#

1_01TrueName # job seeker real name

1_02Email # electronic mail box address

1_03Mobel # mobile phone number

1_04Phone # telephone number

Sex # 1_05Sex _ s

…

InfoTYPE2 duration _ TRS # educational experience

2_00E _ StartTime _ s # education Start date

2_01E_StartTime_d#

2_02E _ EndTime _ s # education end date

2_03E_EndTime_d#

2_04E _ SchoolName # school name

…

InfoTYPE3 TWorkExperience _ TRS # working experience

3_00 InducedDate _ s # work start date

3_01InductionDate_d#

3_02 DimissionsDate _ s # job end date

3_03DimissionDate_d#

3_04company name

…

2. Defining a generic dictionary

In order to facilitate writing of the rules, some keywords can be extracted through manual observation. These words may have been segmented into one word or into several words. Gb and add \ t and generic names to each entry, for example:

name kr1# kr1_01

E-mail kr1# kr1_02

Mobile phone number kr1# kr1_03

Telephone kr1# kr1_04

Gender kr1# kr1_05

Marry no kr1# kr1_07

…

3. Text segmentation and entity recognition

The text is subjected to word segmentation and entity recognition, and each word segmentation result is automatically given to a category. The system-in generic includes: name, loc, org, id, digit, etc.

Two), text block classification

The resume text needs to be divided into six blocks according to the defined information extraction content. Taking basic information partitioning as an example, the writing mode rule is as follows:

1_00_0_S_1_1:kr1&kru-(bdfhm|bdfhk)

1_00_1_E_n_1:(SYS_TRUE-kru)|(kru&(bdfhm|bdfhk)-kr9)

1_01_0_S_1_1:kr1&kru&bdfhm

1_01_1_E_n_1:(SYS_TRUE-kru)|(kru-bdfhm-kr9)

1_02_0_S_1_1:kr1&kru&bdfhk

1_02_1_E_n_1:(SYS_TRUE-kru)|(kru-bdfhk-kr9)

1_03_0_S_1_1:kr1&(bdfhm|kru)

1_03_1_E_n_1:SYS_TRUE-(bdfhm|kru)

1_04_0_S_1_1:english-(kr0|kr1|kr2|kr3|kr4|kr5|kr6|kr7|kr8|kr9)

1_04_1_E_n_1:kr1-kru

three), information extraction

Taking basic information as an example, the information extraction rules are written as follows:

1) direct identification according to genus

^1,class＝'name'；＝>1,1,1_01,n

Indicating that the identification that the category is a name is a resume name.

2) From contextual keyword identification, for example:

^1,class＝’kr1_01’；

1,class＝’bdfhm’；

n,～&class！＝'bdfh'&class！＝'kr'；

1,class＝'row'|class＝'kr'；

＝>3,3,1_01,n

the above expression represents: beginning with "kr 1_01+ colon," and following until a line feed or other key is encountered, the middle nodes are identified as resume names.

As shown in fig. 6, the present embodiment further provides an electronic device, including:

a memory 210;

a processor 220; and

a computer program;

wherein a computer program is stored in the memory 210 and configured to be executed by the processor 220 to implement any of the text event information extraction methods provided above.

In addition, the present embodiment also provides a non-volatile storage medium, on which a computer program is stored, which when executed implements any of the steps of the text event information extraction method using statistics in combination with rules as provided above.

Those of ordinary skill in the art will understand that: the above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC, an FPGA, or an SoC. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

Finally, it should be understood that the above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Those skilled in the art can make many changes and simple substitutions to the technical solution of the present invention without departing from the technical solution of the present invention, and the technical solution of the present invention is protected by the following claims.

Claims

1. A text event information extraction method is characterized by comprising the following steps:

preprocessing a text, wherein the preprocessing comprises word segmentation of the text, vector conversion of word segmentation to obtain word vectors, and input of the word vectors into a deep neural network model for entity recognition to obtain entity tag sequences corresponding to entities; wherein the deep neural network model comprises a bidirectional RNN layer and a CRF layer; the word vectors are sequentially sent to a bidirectional RNN layer to obtain the probability distribution of word segmentation labels, and the probability distribution of the word segmentation labels is sent to a CRF layer to obtain entity label sequences corresponding to entities;

the method comprises the steps of carrying out blocking processing on a text to obtain text blocks, and carrying out classification and extraction processing on the text blocks, wherein the classification and extraction processing on the text blocks comprises the following steps: based on the information type defined by the text format characteristics, arranging the participles and the entities in the text block into a structured text block according to the corresponding mode rule defined by the grammar; the configuration information extraction model of the mode rule can be configured according to different application scenes; judging whether the continuous lines accord with a specific mode or not by adopting a corresponding mode rule sequence defined by a grammar, and storing a mode rule result matched with each line into a multi-dimensional array of a character string type after the text accords with the corresponding specific mode;

performing event information extraction processing on the text block after the text is structured, wherein the event information extraction processing comprises the steps of extracting keywords by using a corresponding mode rule defined by the grammar, and outputting the keywords to a result template corresponding to the event information extraction; the configuration information extraction model of the mode rule in the keyword extraction adopts a rule description language NPRDL for expression writing, and the NPRDL adopts a BNF paradigm; and the rule description language is a means for describing the syntactic semantic information of the vocabulary based on a complex feature set, and is also described in dynamic analysis by using a dynamic attribute table described based on the complex feature set.

2. The method of claim 1, wherein when the text is a personal resume, the personal resume is processed by blocks and comprises a first text block corresponding to basic information, a second text block corresponding to an education experience, a third text block corresponding to a work experience, a fourth text block corresponding to a training experience, a fifth text block corresponding to a qualification certificate, and a sixth text block corresponding to a job-seeking intention; the format characteristics respectively comprise information characteristics corresponding to basic information, education experiences, work experiences, training experiences, qualification certificates and job seeking intentions; the corresponding mode rule defined by the grammar comprises a judgment rule defined according to lexical analysis, syntactic analysis and semantic analysis in the compiling principle.

3. The method of claim 1, wherein the word segmentation of the text comprises: in the organization of the core dictionary, a method of a double array trie tree is adopted; aiming at the intersection type word segmentation ambiguity, a method combining rules and statistics is adopted; aiming at the recognition of unknown words, a recognition method based on a conditional random field is adopted.

4. The method according to claim 1, wherein the pattern rules in the keyword extraction are provided with generic information, the pattern rules in the keyword extraction further include text inline context rule analysis, the text inline context rule analysis includes results of word segmentation and entity recognition on a text, a predetermined rule correction method is used to correct word segmentation results, and ambiguous categories are re-identified.

5. The method of claim 1, wherein the pattern rules in the keyword extraction include simplifying merge rules, and wherein complex long rules are placed in front and simple short rules are placed in back.

6. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-5.

7. A non-volatile storage medium having a computer program stored thereon, characterized in that the computer program, when executed, implements the steps of the method according to any one of claims 1-5.