CN111460789A

CN111460789A - L STM sentence segmentation method, system and medium based on character embedding

Info

Publication number: CN111460789A
Application number: CN202010412860.1A
Authority: CN
Inventors: 赵强利
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-07-28
Anticipated expiration: 2040-05-15
Also published as: CN111460789B

Abstract

The invention discloses a L STM sentence dividing method, a system and a medium based on character embedding, wherein the L STM sentence dividing method based on character embedding comprises the steps of traversing to obtain a current candidate character, obtaining character strings before and after the current candidate character and using the character strings as two inputs of a trained L STM sentence dividing model M to obtain a prediction result of the current candidate character, and carrying out different sentence dividing processing modes according to whether the prediction result is a terminal character.

Description

L STM sentence segmentation method, system and medium based on character embedding

Technical Field

The invention relates to a text mining technology, in particular to an L STM clause method, a system and a medium based on character embedding, which are particularly suitable for clause text mining of biomedical documents.

Background

The PubMed document library currently provides about 3000 thousands of abstracts and 500 thousands of full texts, which are important data sources for text mining in the biomedical field. The method is an important method for constructing a basic database in the field by mining biomedical documents and automatically acquiring named entities such as genes, variations, diseases, medicines and the like.

Clauses are an important basic step for text mining and acquiring named entities, and the accuracy of the clauses directly influences the result of the text mining. In natural language understanding, english clauses are relatively simple to process, and are usually divided by using a rule matching method, for example, several characters are defined as sentence end symbols, and a document is segmented at the sentence end symbols. Since the clauses of the biomedical documents have particularity, such as author name abbreviations, domain professional term abbreviations, diseases and variant entities and the like which are frequently found in the biomedical documents, a large number of special characters such as small brackets, middle brackets, periods, quotation marks, smaller numbers and the like exist in the special words, the conventional rules are adopted to match the clauses, the special characters are easily recognized as sentence end marks, sentence errors are caused, and the result of Named Entity Recognition (NER) is seriously influenced.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an L STM sentence segmentation method, a system and a medium based on character embedding, the L STM sentence segmentation method based on character embedding of the invention takes character information related to the front and back of a candidate end symbol as the input of L STM, and the method for performing sentence segmentation judgment by using context information of the candidate end symbol can accurately distinguish whether the candidate end symbol in a text is the end of a sentence or a special character in a document.

In order to solve the technical problems, the invention adopts the technical scheme that:

an L STM clause dividing method based on character embedding comprises the implementation steps of:

1) initialization: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;

2) starting from current _ site of current position, backward scanning whole input document D to obtain current position currentThe nearest candidate ending character of _siteis taken as the current candidate character

If the acquisition is successful, skipping to execute the step 3); otherwise, skipping to execute the step 8);

3) obtaining current candidate character

Previous string StringA, current candidate character

The latter string StringB;

4) respectively taking a character string StringA and a character string StringB as two inputs of a trained L STM clause model M to obtain a current candidate character

The prediction result M (D, Position) ((D, Position))

))；

5) Judging the current candidate character

The prediction result M (D, Position) ((D, Position))

) Whether it is a trailer, if it is a trailer, jump to execute step 6); otherwise, skipping to execute the step 7);

6) starting the sentence start position sensor _ begin to the current candidate character

Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, current _ site and sentence of current position are processedThe start position sensor _ begin is set as the current candidate character

Then jumping to execute the step 2) at the position of the next printable character;

7) judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, and skipping to execute the step 8); otherwise, setting current _ site at current position as current character candidate

8) processing for the case of no ending character at the end of the document: outputting a character string from the beginning of the sentence starting position sensor _ begin to the end of the last printable character of the input document D as a sentence; and (5) predicting the end of clause.

Optionally, when the candidate end character closest to current _ site at the current position is obtained in step 2), the candidate end character is the candidate sentence end character set {,? In time), ], ",! Any one of the candidate sentence end character sets, which includes six candidate end characters in total and are separated by english commas respectively.

Optionally, obtaining the current candidate character in step 3)

The detailed steps of the previous string StringA include: judging the current candidate character

Whether m spaces exist before, if m spaces exist, the current candidate character is selected

Starting with the previous m-th space character to the current character candidate

Taking a character string of the previous character as a character string StringA; otherwise, directly taking the starting position of the document to the current candidate character

The string ending with the previous character is taken as string StringA.

Optionally, obtaining the current candidate character in step 3)

The detailed steps of the latter string StringB include: judging the current candidate character

Whether n spaces exist later, if n spaces exist, the character is selected from the current candidate character

The character string starting from the next character to the nth space character is used as the character string StringB; if the end of the document is reached in advance, the current candidate character is taken

The character string starting from the next character to the end position of the document is taken as the character string StringB.

Optionally, the trained L STM sentence pattern M in step 4) includes two character-level L STMs, a concatenation layer, a full-concatenation layer and an output layer, where the StringA and StringB obtained in step 3) are two inputs of the L STM sentence pattern M, and are respectively input to one character-level L STM, the concatenation layer is used to concatenate the output of the two character-level L STMs and the character embedding vectors of the candidate end characters as inputs of the full-concatenation layer, and the output layer is used to output the prediction result of whether the sentence end character is a sentence end character.

Optionally, step 4) is preceded by a step of training L STM clause model M, and the detailed steps include:

s1) carrying out manual sentence division calibration on the specified number of documents to obtain a training sample set;

s2) determining a candidate sentence end character set;

s3) setting training parameters of the L STM clause model M, and randomly generating character embedding vectors of each character;

s4) training a L STM clause model M by using the sample document marked by the manual clause and the character embedding vector thereof;

s5), judging whether the preset training end condition is reached, if the preset training end condition is reached, outputting the trained L STM clause model M, ending and exiting, otherwise, skipping to execute the step S4).

Optionally, the training step of the sample document and the character embedding vector thereof calibrated for each artificial clause in step S4) includes:

s4.1) setting the current _ site of the current position of the sample document as the first printable character of the sample document;

s4.2) scanning the sample document from the current _ site of the current position, and finding the first candidate end character as the current candidate character

S4.3) obtaining the current candidate character

Previous string StringA, current candidate character

The latter string StringB;

s4.4) regarding the character string StringA as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a first L STM, and taking the output of the L STM as VEC1, regarding the character string StringB as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a second L STM, and taking the output of the L STM as VEC2, then sequentially splicing the output result VEC1, the character embedded vectors of the candidate ending character Y and the output result VEC2 as the input of a full-connection layer, obtaining a prediction result by the output layer, and reversely updating the network weight and the character embedded vectors by adopting a gradient descent method according to the difference between the output of the output layer and an actual calibration result;

s4.5) updating current position current _ site to current candidate character

The next character position of (a); if the end of the document has been reached, ending and exiting; otherwise the jump is performed to step S4.2).

In addition, the invention also provides an L STM clause system based on character embedding, which comprises:

the initialization program unit is used for initializing and executing: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;

a current candidate character search program unit for scanning the whole input document D backward from the current position current _ site and obtaining the candidate end character closest to the current position current _ site as the current candidate character

Skipping to execute the character string and acquiring a program unit; otherwise, skipping to execute the post-processing program unit;

a character string acquisition program unit for acquiring a current candidate character

Previous string StringA, current candidate character

The latter string StringB;

a prediction model calling program unit for respectively using the character string StringA and the character string StringB as two inputs of the trained L STM clause model M to obtain the current candidate character

The prediction result M (D, Position) ((D, Position))

))；

A prediction result judging program unit for judging the current candidate character

The prediction result M (D, Position) ((D, Position))

) Whether the ending symbol is an ending symbol or not, and if the ending symbol is the ending symbol, skipping to execute an ending symbol processing program unit; otherwise, skipping to execute the non-ending symbol processing program unit;

an end character processing unit for outputting the beginning of the sentence start position sensor _ begin to the current candidate character

All strings that are truncated; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character

Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;

the non-ending symbol processing program unit is used for judging whether the input document D still has printable characters or not, if the input document D does not have printable characters, the ending of the input document D is indicated to be reached, and the post-processing program unit is skipped to execute; otherwise, setting current _ site at current position as current character candidate

a post-processing program unit, for processing the condition of no ending character at the ending part of the document: regarding a character string starting from the beginning of the sentence start position sensor _ begin to the end of the last printable character of the input document D as a sentence and outputting the character string; and (5) predicting the end of clause.

Furthermore, the present invention also provides a character embedding based L STM clause system including a computer device programmed or configured to perform the steps of the character embedding based L STM clause method or a storage medium of the computer device having stored thereon a computer program programmed or configured to perform the character embedding based L STM clause method.

Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the character-based embedded L STM clause method.

Compared with the prior art, the invention has the advantages that the L STM sentence segmentation method based on character embedding takes the character information related to the front and the back of the candidate ending symbol as the input of L STM, and the method for carrying out sentence segmentation judgment by using the context information of the candidate ending symbol can accurately distinguish whether the candidate ending symbol in the text is the ending of a sentence or a special character in a document.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of L STM clause model M in the embodiment of the present invention.

Detailed Description

As shown in fig. 1, the implementation steps of the L STM clause method based on character embedding of the present embodiment include:

2) starting from current _ site at the current position, the whole input document D is scanned backward to obtain the distanceThe candidate end character nearest to current _ site at the current position is taken as the current candidate character

3) obtaining current candidate character

Previous string StringA, current candidate character

The latter string StringB;

The prediction result M (D, Position) ((D, Position))

))；

5) Judging the current candidate character

The prediction result M (D, Position) ((D, Position))

Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, the current position is determinedcurrent _ site and sentence start position sensor _ begin are set as current candidate character

In this embodiment, when the candidate end character closest to current _ site at the current position is obtained in step 2), the candidate end character is the candidate sentence end character set {,? In time), ], ",! Any one of the candidate sentence end character sets, which includes six candidate end characters in total and are separated by english commas respectively.

In this embodiment, the current candidate character is obtained in step 3)

The string ending with the previous character is taken as string StringA.

In this embodiment, the current candidate character is obtained in step 3)

As shown in fig. 2, the trained L STM sentence pattern M in step 4) includes two character-level L STMs, a concatenation layer, a full-concatenation layer and an output layer, where StringA and StringB obtained in step 3) are two inputs of the L STM sentence pattern M, and are respectively input to one character-level L STM, the concatenation layer is used to concatenate the outputs of the two character-level L STMs and the character embedding vectors of the candidate end characters as inputs of the full-concatenation layer, and the output layer is used to output the prediction result of whether the sentence end character is a sentence end character.

In this embodiment, step 4) further includes a step of training L STM clause model M, and the detailed steps include:

s2) determining a candidate sentence end character set;

In step S1) of this embodiment, the training sample set obtained in step S1) is derived from PubMed digests, and a certain amount of PubMed digests are manually subjected to sentence division calibration, where the number of manually calibrated digests needs to be over 1000.

Step S2) of this embodiment, when determining a candidate sentence end character set, for all characters that may become sentence ends, a candidate sentence end character set is constructed, and the set is currently set to {,? In time), ], ",! Six candidate ending characters.

Step S3) of this embodiment sets L training parameters of the STM sentence model M, and when a character embedding vector of each character is randomly generated, sets parameters in the L STM sentence model, including the dimension k of the character embedding vector, the dimension of the L STM output vector, the number M of words before the candidate end character, and the number n of words after the candidate end character, and by default, the dimension k of the character embedding vector is 32, the input layer of L STM is the dimension of the character embedding vector, the length of the output layer vector is 128 dimensions by default, M is 3, and n is 2, and sets an initial character embedding vector, in which, since each character does not have a corresponding character embedding vector at the beginning, a vector of length k randomly generated for each character is used as the initial value of the character embedding vector, and the default length is 32.

In this embodiment, the training step of the sample document and the character embedding vector thereof calibrated for each artificial clause in step S4) includes:

S4.3) obtaining the current candidate character

Previous string StringA, current candidate character

The latter string StringB;

s4.4) regarding the string StringA as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a first L STM, taking the output of the L STM as VEC1, regarding the string StringB as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a second L STM, taking the output of the L STM as VEC2, sequentially splicing the output results VEC1, the character embedded vectors of the candidate end symbols Y and the output results VEC2 as the input of a full connection layer, obtaining a prediction result by the output layer, reversely updating the network weight and the character embedded vectors by adopting a gradient descent method according to the difference between the output of the output layer and an actual calibration result, and sequentially splicing the character embedded vectors of the VEC1, the candidate end symbols Y and the VEC2 as the input of the full connection layer, obtaining a judgment result by the full connection layer according to the difference between the output of the soft layer and the actual calibration result, and reversely updating the network weight vector by adopting the gradient descent method;

s4.5) updating current position current _ site to current candidate character

After the training stage is completed, the input document D can be automatically claused (inference stage) based on the trained L STM clause model M, details can be seen in fig. 1 and the foregoing steps 1) -8), it can be known by combining the training stage and the inference stage, in this embodiment, the L STM clause method based on character embedding uses character information related to the candidate ending symbol front and back as the input of L STM, and this method for performing clause judgment by using context information where the candidate ending symbol is located can accurately distinguish whether the candidate ending symbol in the text is the end of a sentence or a special character in a document.

In addition, the present embodiment also provides an L STM clause system based on character embedding, including:

an initialization program unit, which is used for setting the initial position sensor _ begin of the sentence as the first printable character position of the input document D and setting the current position current _ site as the initial position sensor _ begin of the sentence;

Previous string StringA, current candidate character

The latter string StringB;

The prediction result M (D, Position) ((D, Position))

))；

The prediction result M (D, Position) ((D, Position))

an ending character processing program unit for outputting the beginning of the first printable character position sensor _ begin to the current candidate character Y

The position of the next printable character thereafter,skipping to execute the current candidate character searching program unit;

In addition, the present embodiment also provides an L STM clause system based on character embedding, which includes a computer device programmed or configured to execute the steps of the aforementioned L STM clause method based on character embedding of the present embodiment, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the aforementioned L STM clause method based on character embedding of the present embodiment.

Furthermore, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the aforementioned character-embedding-based L STM clause method of the present embodiment.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. An L STM sentence segmentation method based on character embedding is characterized by comprising the implementation steps of:

2) scanning the whole input document D backward from the current position current _ site, and acquiring the candidate end character closest to the current position current _ site as the current candidate character

3) obtaining current candidate character

Previous string StringA, current candidate character

The latter string StringB;

Predicted result of (2)

5) Judging the current candidate character

Predicted result of (2)

Whether the ending symbol is the ending symbol or not, and if the ending symbol is the ending symbol, skipping to execute the step 6); otherwise, skipping to execute the step 7);

Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character

2. The character embedding-based L STM clause method according to claim 1, wherein the candidate epilogue closest to current _ site in step 2) is any one of candidate epilogue character sets {,.

3. The character embedding-based L STM clause dividing method according to claim 1, wherein the current candidate character is obtained in step 3)

The string ending with the previous character is taken as string StringA.

4. The character embedding-based L STM clause dividing method according to claim 1, wherein the current candidate character is obtained in step 3)

5. The method for L STM clauses based on character embedding according to any one of claims 1-4, wherein the trained L STM clause model M in step 4) comprises two character levels L STM, a splicing layer, a fully connected layer and an output layer, two paths of input StringA and StringB of the L STM clause model M are respectively input to one character level L STM, the splicing layer is used for splicing the output of the two character levels L STM and character embedding vectors of candidate end characters as the input of the fully connected layer, and the output layer is used for outputting the prediction result of whether the sentence end character is a sentence end character.

6. The character embedding-based L STM clause method according to claim 5, wherein the step 4) is preceded by a step of training L STM clause model M, and the detailed steps comprise:

s2) determining a candidate sentence end character set;

7. The character embedding based L STM sentence dividing method according to claim 6, wherein the training step for each sample document marked up by artificial sentence and its character embedding vector in step S4) comprises:

S4.3) obtaining the current candidate character

Previous string StringA, current candidate character

The latter string StringB;

s4.5) updating current position current _ site to current candidate character

8. An L STM clause system based on character embedding, characterized by comprising:

the initialization program unit is used for initializing and executing: setting the initial position sensor _ begin of the sentence as the first printable character position of the input document D, and setting the current position current _ site as the initial position sensor _ begin of the sentence;

Previous string StringA, current candidate character

The latter string StringB;

Predicted result of (2)

Predicted result of (2)

Whether the ending symbol is the ending symbol or not, if the ending symbol is the ending symbol, skipping to execute an ending symbol processing program unit; otherwise, skipping to execute the non-ending symbol processing program unit;

an end character processing program unit for starting the sentence start position sensor _ begin to the current candidate character

Outputting the cut-off character string as a complete sentence; judging whether the input document D still has printable characters, if the input document D has no printable characters, indicating that the end of the input document D is reached,the sentence prediction is finished and quitting is carried out; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character

9. An L STM clause system based on character embedding, comprising a computer device, characterized in that the computer device is programmed or configured to perform the steps of the L STM clause method based on character embedding as claimed in any one of claims 1 to 7, or that a storage medium of the computer device has stored thereon a computer program programmed or configured to perform the L STM clause method based on character embedding as claimed in any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program programmed or configured to perform the character embedding based L STM clause method of any of claims 1-7.