CN111460789A - L STM sentence segmentation method, system and medium based on character embedding - Google Patents

L STM sentence segmentation method, system and medium based on character embedding Download PDF

Info

Publication number
CN111460789A
CN111460789A CN202010412860.1A CN202010412860A CN111460789A CN 111460789 A CN111460789 A CN 111460789A CN 202010412860 A CN202010412860 A CN 202010412860A CN 111460789 A CN111460789 A CN 111460789A
Authority
CN
China
Prior art keywords
character
current
stm
sentence
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010412860.1A
Other languages
Chinese (zh)
Other versions
CN111460789B (en
Inventor
赵强利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University of Technology
Original Assignee
Hunan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University of Technology filed Critical Hunan University of Technology
Priority to CN202010412860.1A priority Critical patent/CN111460789B/en
Publication of CN111460789A publication Critical patent/CN111460789A/en
Application granted granted Critical
Publication of CN111460789B publication Critical patent/CN111460789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a L STM sentence dividing method, a system and a medium based on character embedding, wherein the L STM sentence dividing method based on character embedding comprises the steps of traversing to obtain a current candidate character, obtaining character strings before and after the current candidate character and using the character strings as two inputs of a trained L STM sentence dividing model M to obtain a prediction result of the current candidate character, and carrying out different sentence dividing processing modes according to whether the prediction result is a terminal character.

Description

L STM sentence segmentation method, system and medium based on character embedding
Technical Field
The invention relates to a text mining technology, in particular to an L STM clause method, a system and a medium based on character embedding, which are particularly suitable for clause text mining of biomedical documents.
Background
The PubMed document library currently provides about 3000 thousands of abstracts and 500 thousands of full texts, which are important data sources for text mining in the biomedical field. The method is an important method for constructing a basic database in the field by mining biomedical documents and automatically acquiring named entities such as genes, variations, diseases, medicines and the like.
Clauses are an important basic step for text mining and acquiring named entities, and the accuracy of the clauses directly influences the result of the text mining. In natural language understanding, english clauses are relatively simple to process, and are usually divided by using a rule matching method, for example, several characters are defined as sentence end symbols, and a document is segmented at the sentence end symbols. Since the clauses of the biomedical documents have particularity, such as author name abbreviations, domain professional term abbreviations, diseases and variant entities and the like which are frequently found in the biomedical documents, a large number of special characters such as small brackets, middle brackets, periods, quotation marks, smaller numbers and the like exist in the special words, the conventional rules are adopted to match the clauses, the special characters are easily recognized as sentence end marks, sentence errors are caused, and the result of Named Entity Recognition (NER) is seriously influenced.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an L STM sentence segmentation method, a system and a medium based on character embedding, the L STM sentence segmentation method based on character embedding of the invention takes character information related to the front and back of a candidate end symbol as the input of L STM, and the method for performing sentence segmentation judgment by using context information of the candidate end symbol can accurately distinguish whether the candidate end symbol in a text is the end of a sentence or a special character in a document.
In order to solve the technical problems, the invention adopts the technical scheme that:
an L STM clause dividing method based on character embedding comprises the implementation steps of:
1) initialization: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;
2) starting from current _ site of current position, backward scanning whole input document D to obtain current position currentThe nearest candidate ending character of _siteis taken as the current candidate character
Figure BDA0002493930720000011
If the acquisition is successful, skipping to execute the step 3); otherwise, skipping to execute the step 8);
3) obtaining current candidate character
Figure BDA0002493930720000012
Previous string StringA, current candidate character
Figure BDA0002493930720000013
The latter string StringB;
4) respectively taking a character string StringA and a character string StringB as two inputs of a trained L STM clause model M to obtain a current candidate character
Figure BDA00024939307200000218
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000028
));
5) Judging the current candidate character
Figure BDA0002493930720000029
The prediction result M (D, Position) ((D, Position))
Figure BDA00024939307200000210
) Whether it is a trailer, if it is a trailer, jump to execute step 6); otherwise, skipping to execute the step 7);
6) starting the sentence start position sensor _ begin to the current candidate character
Figure BDA00024939307200000211
Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, current _ site and sentence of current position are processedThe start position sensor _ begin is set as the current candidate character
Figure BDA00024939307200000212
Then jumping to execute the step 2) at the position of the next printable character;
7) judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, and skipping to execute the step 8); otherwise, setting current _ site at current position as current character candidate
Figure BDA00024939307200000219
Then jumping to execute the step 2) at the position of the next printable character;
8) processing for the case of no ending character at the end of the document: outputting a character string from the beginning of the sentence starting position sensor _ begin to the end of the last printable character of the input document D as a sentence; and (5) predicting the end of clause.
Optionally, when the candidate end character closest to current _ site at the current position is obtained in step 2), the candidate end character is the candidate sentence end character set {,? In time), ], ",! Any one of the candidate sentence end character sets, which includes six candidate end characters in total and are separated by english commas respectively.
Optionally, obtaining the current candidate character in step 3)
Figure BDA00024939307200000214
The detailed steps of the previous string StringA include: judging the current candidate character
Figure BDA00024939307200000215
Whether m spaces exist before, if m spaces exist, the current candidate character is selected
Figure BDA00024939307200000220
Starting with the previous m-th space character to the current character candidate
Figure BDA00024939307200000217
Taking a character string of the previous character as a character string StringA; otherwise, directly taking the starting position of the document to the current candidate character
Figure BDA0002493930720000022
The string ending with the previous character is taken as string StringA.
Optionally, obtaining the current candidate character in step 3)
Figure BDA0002493930720000023
The detailed steps of the latter string StringB include: judging the current candidate character
Figure BDA0002493930720000024
Whether n spaces exist later, if n spaces exist, the character is selected from the current candidate character
Figure BDA0002493930720000025
The character string starting from the next character to the nth space character is used as the character string StringB; if the end of the document is reached in advance, the current candidate character is taken
Figure BDA0002493930720000026
The character string starting from the next character to the end position of the document is taken as the character string StringB.
Optionally, the trained L STM sentence pattern M in step 4) includes two character-level L STMs, a concatenation layer, a full-concatenation layer and an output layer, where the StringA and StringB obtained in step 3) are two inputs of the L STM sentence pattern M, and are respectively input to one character-level L STM, the concatenation layer is used to concatenate the output of the two character-level L STMs and the character embedding vectors of the candidate end characters as inputs of the full-concatenation layer, and the output layer is used to output the prediction result of whether the sentence end character is a sentence end character.
Optionally, step 4) is preceded by a step of training L STM clause model M, and the detailed steps include:
s1) carrying out manual sentence division calibration on the specified number of documents to obtain a training sample set;
s2) determining a candidate sentence end character set;
s3) setting training parameters of the L STM clause model M, and randomly generating character embedding vectors of each character;
s4) training a L STM clause model M by using the sample document marked by the manual clause and the character embedding vector thereof;
s5), judging whether the preset training end condition is reached, if the preset training end condition is reached, outputting the trained L STM clause model M, ending and exiting, otherwise, skipping to execute the step S4).
Optionally, the training step of the sample document and the character embedding vector thereof calibrated for each artificial clause in step S4) includes:
s4.1) setting the current _ site of the current position of the sample document as the first printable character of the sample document;
s4.2) scanning the sample document from the current _ site of the current position, and finding the first candidate end character as the current candidate character
Figure BDA00024939307200000313
S4.3) obtaining the current candidate character
Figure BDA0002493930720000032
Previous string StringA, current candidate character
Figure BDA0002493930720000033
The latter string StringB;
s4.4) regarding the character string StringA as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a first L STM, and taking the output of the L STM as VEC1, regarding the character string StringB as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a second L STM, and taking the output of the L STM as VEC2, then sequentially splicing the output result VEC1, the character embedded vectors of the candidate ending character Y and the output result VEC2 as the input of a full-connection layer, obtaining a prediction result by the output layer, and reversely updating the network weight and the character embedded vectors by adopting a gradient descent method according to the difference between the output of the output layer and an actual calibration result;
s4.5) updating current position current _ site to current candidate character
Figure BDA0002493930720000034
The next character position of (a); if the end of the document has been reached, ending and exiting; otherwise the jump is performed to step S4.2).
In addition, the invention also provides an L STM clause system based on character embedding, which comprises:
the initialization program unit is used for initializing and executing: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;
a current candidate character search program unit for scanning the whole input document D backward from the current position current _ site and obtaining the candidate end character closest to the current position current _ site as the current candidate character
Figure BDA00024939307200000314
Skipping to execute the character string and acquiring a program unit; otherwise, skipping to execute the post-processing program unit;
a character string acquisition program unit for acquiring a current candidate character
Figure BDA0002493930720000036
Previous string StringA, current candidate character
Figure BDA0002493930720000037
The latter string StringB;
a prediction model calling program unit for respectively using the character string StringA and the character string StringB as two inputs of the trained L STM clause model M to obtain the current candidate character
Figure BDA0002493930720000038
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000039
));
A prediction result judging program unit for judging the current candidate character
Figure BDA00024939307200000310
The prediction result M (D, Position) ((D, Position))
Figure BDA00024939307200000311
) Whether the ending symbol is an ending symbol or not, and if the ending symbol is the ending symbol, skipping to execute an ending symbol processing program unit; otherwise, skipping to execute the non-ending symbol processing program unit;
an end character processing unit for outputting the beginning of the sentence start position sensor _ begin to the current candidate character
Figure BDA00024939307200000312
All strings that are truncated; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character
Figure BDA0002493930720000041
Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;
the non-ending symbol processing program unit is used for judging whether the input document D still has printable characters or not, if the input document D does not have printable characters, the ending of the input document D is indicated to be reached, and the post-processing program unit is skipped to execute; otherwise, setting current _ site at current position as current character candidate
Figure BDA0002493930720000042
Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;
a post-processing program unit, for processing the condition of no ending character at the ending part of the document: regarding a character string starting from the beginning of the sentence start position sensor _ begin to the end of the last printable character of the input document D as a sentence and outputting the character string; and (5) predicting the end of clause.
Furthermore, the present invention also provides a character embedding based L STM clause system including a computer device programmed or configured to perform the steps of the character embedding based L STM clause method or a storage medium of the computer device having stored thereon a computer program programmed or configured to perform the character embedding based L STM clause method.
Furthermore, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the character-based embedded L STM clause method.
Compared with the prior art, the invention has the advantages that the L STM sentence segmentation method based on character embedding takes the character information related to the front and the back of the candidate ending symbol as the input of L STM, and the method for carrying out sentence segmentation judgment by using the context information of the candidate ending symbol can accurately distinguish whether the candidate ending symbol in the text is the ending of a sentence or a special character in a document.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of L STM clause model M in the embodiment of the present invention.
Detailed Description
As shown in fig. 1, the implementation steps of the L STM clause method based on character embedding of the present embodiment include:
1) initialization: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;
2) starting from current _ site at the current position, the whole input document D is scanned backward to obtain the distanceThe candidate end character nearest to current _ site at the current position is taken as the current candidate character
Figure BDA0002493930720000046
If the acquisition is successful, skipping to execute the step 3); otherwise, skipping to execute the step 8);
3) obtaining current candidate character
Figure BDA0002493930720000044
Previous string StringA, current candidate character
Figure BDA0002493930720000045
The latter string StringB;
4) respectively taking a character string StringA and a character string StringB as two inputs of a trained L STM clause model M to obtain a current candidate character
Figure BDA0002493930720000051
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000052
));
5) Judging the current candidate character
Figure BDA0002493930720000053
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000054
) Whether it is a trailer, if it is a trailer, jump to execute step 6); otherwise, skipping to execute the step 7);
6) starting the sentence start position sensor _ begin to the current candidate character
Figure BDA0002493930720000055
Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, the current position is determinedcurrent _ site and sentence start position sensor _ begin are set as current candidate character
Figure BDA0002493930720000056
Then jumping to execute the step 2) at the position of the next printable character;
7) judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, and skipping to execute the step 8); otherwise, setting current _ site at current position as current character candidate
Figure BDA0002493930720000057
Then jumping to execute the step 2) at the position of the next printable character;
8) processing for the case of no ending character at the end of the document: outputting a character string from the beginning of the sentence starting position sensor _ begin to the end of the last printable character of the input document D as a sentence; and (5) predicting the end of clause.
In this embodiment, when the candidate end character closest to current _ site at the current position is obtained in step 2), the candidate end character is the candidate sentence end character set {,? In time), ], ",! Any one of the candidate sentence end character sets, which includes six candidate end characters in total and are separated by english commas respectively.
In this embodiment, the current candidate character is obtained in step 3)
Figure BDA0002493930720000058
The detailed steps of the previous string StringA include: judging the current candidate character
Figure BDA0002493930720000059
Whether m spaces exist before, if m spaces exist, the current candidate character is selected
Figure BDA00024939307200000510
Starting with the previous m-th space character to the current character candidate
Figure BDA00024939307200000511
Taking a character string of the previous character as a character string StringA; otherwise, directly taking the starting position of the document to the current candidate character
Figure BDA00024939307200000512
The string ending with the previous character is taken as string StringA.
In this embodiment, the current candidate character is obtained in step 3)
Figure BDA00024939307200000513
The detailed steps of the latter string StringB include: judging the current candidate character
Figure BDA00024939307200000514
Whether n spaces exist later, if n spaces exist, the character is selected from the current candidate character
Figure BDA00024939307200000515
The character string starting from the next character to the nth space character is used as the character string StringB; if the end of the document is reached in advance, the current candidate character is taken
Figure BDA00024939307200000516
The character string starting from the next character to the end position of the document is taken as the character string StringB.
As shown in fig. 2, the trained L STM sentence pattern M in step 4) includes two character-level L STMs, a concatenation layer, a full-concatenation layer and an output layer, where StringA and StringB obtained in step 3) are two inputs of the L STM sentence pattern M, and are respectively input to one character-level L STM, the concatenation layer is used to concatenate the outputs of the two character-level L STMs and the character embedding vectors of the candidate end characters as inputs of the full-concatenation layer, and the output layer is used to output the prediction result of whether the sentence end character is a sentence end character.
In this embodiment, step 4) further includes a step of training L STM clause model M, and the detailed steps include:
s1) carrying out manual sentence division calibration on the specified number of documents to obtain a training sample set;
s2) determining a candidate sentence end character set;
s3) setting training parameters of the L STM clause model M, and randomly generating character embedding vectors of each character;
s4) training a L STM clause model M by using the sample document marked by the manual clause and the character embedding vector thereof;
s5), judging whether the preset training end condition is reached, if the preset training end condition is reached, outputting the trained L STM clause model M, ending and exiting, otherwise, skipping to execute the step S4).
In step S1) of this embodiment, the training sample set obtained in step S1) is derived from PubMed digests, and a certain amount of PubMed digests are manually subjected to sentence division calibration, where the number of manually calibrated digests needs to be over 1000.
Step S2) of this embodiment, when determining a candidate sentence end character set, for all characters that may become sentence ends, a candidate sentence end character set is constructed, and the set is currently set to {,? In time), ], ",! Six candidate ending characters.
Step S3) of this embodiment sets L training parameters of the STM sentence model M, and when a character embedding vector of each character is randomly generated, sets parameters in the L STM sentence model, including the dimension k of the character embedding vector, the dimension of the L STM output vector, the number M of words before the candidate end character, and the number n of words after the candidate end character, and by default, the dimension k of the character embedding vector is 32, the input layer of L STM is the dimension of the character embedding vector, the length of the output layer vector is 128 dimensions by default, M is 3, and n is 2, and sets an initial character embedding vector, in which, since each character does not have a corresponding character embedding vector at the beginning, a vector of length k randomly generated for each character is used as the initial value of the character embedding vector, and the default length is 32.
In this embodiment, the training step of the sample document and the character embedding vector thereof calibrated for each artificial clause in step S4) includes:
s4.1) setting the current _ site of the current position of the sample document as the first printable character of the sample document;
s4.2) scanning the sample document from the current _ site of the current position, and finding the first candidate end character as the current candidate character
Figure BDA0002493930720000064
S4.3) obtaining the current candidate character
Figure BDA0002493930720000062
Previous string StringA, current candidate character
Figure BDA0002493930720000063
The latter string StringB;
s4.4) regarding the string StringA as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a first L STM, taking the output of the L STM as VEC1, regarding the string StringB as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a second L STM, taking the output of the L STM as VEC2, sequentially splicing the output results VEC1, the character embedded vectors of the candidate end symbols Y and the output results VEC2 as the input of a full connection layer, obtaining a prediction result by the output layer, reversely updating the network weight and the character embedded vectors by adopting a gradient descent method according to the difference between the output of the output layer and an actual calibration result, and sequentially splicing the character embedded vectors of the VEC1, the candidate end symbols Y and the VEC2 as the input of the full connection layer, obtaining a judgment result by the full connection layer according to the difference between the output of the soft layer and the actual calibration result, and reversely updating the network weight vector by adopting the gradient descent method;
s4.5) updating current position current _ site to current candidate character
Figure BDA0002493930720000071
The next character position of (a); if the end of the document has been reached, ending and exiting; otherwise the jump is performed to step S4.2).
After the training stage is completed, the input document D can be automatically claused (inference stage) based on the trained L STM clause model M, details can be seen in fig. 1 and the foregoing steps 1) -8), it can be known by combining the training stage and the inference stage, in this embodiment, the L STM clause method based on character embedding uses character information related to the candidate ending symbol front and back as the input of L STM, and this method for performing clause judgment by using context information where the candidate ending symbol is located can accurately distinguish whether the candidate ending symbol in the text is the end of a sentence or a special character in a document.
In addition, the present embodiment also provides an L STM clause system based on character embedding, including:
an initialization program unit, which is used for setting the initial position sensor _ begin of the sentence as the first printable character position of the input document D and setting the current position current _ site as the initial position sensor _ begin of the sentence;
a current candidate character search program unit for scanning the whole input document D backward from the current position current _ site and obtaining the candidate end character closest to the current position current _ site as the current candidate character
Figure BDA00024939307200000712
Skipping to execute the character string and acquiring a program unit; otherwise, skipping to execute the post-processing program unit;
a character string acquisition program unit for acquiring a current candidate character
Figure BDA0002493930720000073
Previous string StringA, current candidate character
Figure BDA0002493930720000074
The latter string StringB;
a prediction model calling program unit for respectively using the character string StringA and the character string StringB as two inputs of the trained L STM clause model M to obtain the current candidate character
Figure BDA0002493930720000075
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000076
));
A prediction result judging program unit for judging the current candidate character
Figure BDA0002493930720000077
The prediction result M (D, Position) ((D, Position))
Figure BDA0002493930720000078
) Whether the ending symbol is an ending symbol or not, and if the ending symbol is the ending symbol, skipping to execute an ending symbol processing program unit; otherwise, skipping to execute the non-ending symbol processing program unit;
an ending character processing program unit for outputting the beginning of the first printable character position sensor _ begin to the current candidate character Y
Figure BDA0002493930720000079
All strings that are truncated; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character
Figure BDA00024939307200000710
Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;
the non-ending symbol processing program unit is used for judging whether the input document D still has printable characters or not, if the input document D does not have printable characters, the ending of the input document D is indicated to be reached, and the post-processing program unit is skipped to execute; otherwise, setting current _ site at current position as current character candidate
Figure BDA00024939307200000711
The position of the next printable character thereafter,skipping to execute the current candidate character searching program unit;
a post-processing program unit, for processing the condition of no ending character at the ending part of the document: regarding a character string starting from the beginning of the sentence start position sensor _ begin to the end of the last printable character of the input document D as a sentence and outputting the character string; and (5) predicting the end of clause.
In addition, the present embodiment also provides an L STM clause system based on character embedding, which includes a computer device programmed or configured to execute the steps of the aforementioned L STM clause method based on character embedding of the present embodiment, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the aforementioned L STM clause method based on character embedding of the present embodiment.
Furthermore, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the aforementioned character-embedding-based L STM clause method of the present embodiment.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. An L STM sentence segmentation method based on character embedding is characterized by comprising the implementation steps of:
1) initialization: calibrating the sentence starting position sensor _ begin as the first printable character position of the input document D, and setting the current position current _ site as the sentence starting position sensor _ begin;
2) scanning the whole input document D backward from the current position current _ site, and acquiring the candidate end character closest to the current position current _ site as the current candidate character
Figure FDA0002493930710000011
If the acquisition is successful, skipping to execute the step 3); otherwise, skipping to execute the step 8);
3) obtaining current candidate character
Figure FDA0002493930710000012
Previous string StringA, current candidate character
Figure FDA0002493930710000013
The latter string StringB;
4) respectively taking a character string StringA and a character string StringB as two inputs of a trained L STM clause model M to obtain a current candidate character
Figure FDA0002493930710000014
Predicted result of (2)
Figure FDA0002493930710000015
5) Judging the current candidate character
Figure FDA0002493930710000016
Predicted result of (2)
Figure FDA0002493930710000017
Whether the ending symbol is the ending symbol or not, and if the ending symbol is the ending symbol, skipping to execute the step 6); otherwise, skipping to execute the step 7);
6) starting the sentence start position sensor _ begin to the current candidate character
Figure FDA0002493930710000018
Outputting the cut-off character string as a complete sentence; judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, ending sentence prediction and exiting; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character
Figure FDA0002493930710000019
Then jumping to execute the step 2) at the position of the next printable character;
7) judging whether printable characters still exist in the input document D, if no printable characters exist in the input document D, indicating that the end of the input document D is reached, and skipping to execute the step 8); otherwise, setting current _ site at current position as current character candidate
Figure FDA00024939307100000110
Then jumping to execute the step 2) at the position of the next printable character;
8) processing for the case of no ending character at the end of the document: outputting a character string from the beginning of the sentence starting position sensor _ begin to the end of the last printable character of the input document D as a sentence; and (5) predicting the end of clause.
2. The character embedding-based L STM clause method according to claim 1, wherein the candidate epilogue closest to current _ site in step 2) is any one of candidate epilogue character sets {,.
3. The character embedding-based L STM clause dividing method according to claim 1, wherein the current candidate character is obtained in step 3)
Figure FDA00024939307100000111
The detailed steps of the previous string StringA include: judging the current candidate character
Figure FDA00024939307100000112
Whether m spaces exist before, if m spaces exist, the current candidate character is selected
Figure FDA00024939307100000113
Starting with the previous m-th space character to the current character candidate
Figure FDA00024939307100000114
Taking a character string of the previous character as a character string StringA; otherwise, directly taking the starting position of the document to the current candidate character
Figure FDA00024939307100000115
The string ending with the previous character is taken as string StringA.
4. The character embedding-based L STM clause dividing method according to claim 1, wherein the current candidate character is obtained in step 3)
Figure FDA0002493930710000021
The detailed steps of the latter string StringB include: judging the current candidate character
Figure FDA0002493930710000022
Whether n spaces exist later, if n spaces exist, the character is selected from the current candidate character
Figure FDA0002493930710000023
The character string starting from the next character to the nth space character is used as the character string StringB; if the end of the document is reached in advance, the current candidate character is taken
Figure FDA0002493930710000024
The character string starting from the next character to the end position of the document is taken as the character string StringB.
5. The method for L STM clauses based on character embedding according to any one of claims 1-4, wherein the trained L STM clause model M in step 4) comprises two character levels L STM, a splicing layer, a fully connected layer and an output layer, two paths of input StringA and StringB of the L STM clause model M are respectively input to one character level L STM, the splicing layer is used for splicing the output of the two character levels L STM and character embedding vectors of candidate end characters as the input of the fully connected layer, and the output layer is used for outputting the prediction result of whether the sentence end character is a sentence end character.
6. The character embedding-based L STM clause method according to claim 5, wherein the step 4) is preceded by a step of training L STM clause model M, and the detailed steps comprise:
s1) carrying out manual sentence division calibration on the specified number of documents to obtain a training sample set;
s2) determining a candidate sentence end character set;
s3) setting training parameters of the L STM clause model M, and randomly generating character embedding vectors of each character;
s4) training a L STM clause model M by using the sample document marked by the manual clause and the character embedding vector thereof;
s5), judging whether the preset training end condition is reached, if the preset training end condition is reached, outputting the trained L STM clause model M, ending and exiting, otherwise, skipping to execute the step S4).
7. The character embedding based L STM sentence dividing method according to claim 6, wherein the training step for each sample document marked up by artificial sentence and its character embedding vector in step S4) comprises:
s4.1) setting the current _ site of the current position of the sample document as the first printable character of the sample document;
s4.2) scanning the sample document from the current _ site of the current position, and finding the first candidate end character as the current candidate character
Figure FDA0002493930710000025
S4.3) obtaining the current candidate character
Figure FDA0002493930710000026
Previous string StringA, current candidate character
Figure FDA0002493930710000027
The latter string StringB;
s4.4) regarding the character string StringA as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a first L STM, and taking the output of the L STM as VEC1, regarding the character string StringB as a character sequence, taking a character vector corresponding to each character in the sequence as the input of a second L STM, and taking the output of the L STM as VEC2, then sequentially splicing the output result VEC1, the character embedded vectors of the candidate ending character Y and the output result VEC2 as the input of a full-connection layer, obtaining a prediction result by the output layer, and reversely updating the network weight and the character embedded vectors by adopting a gradient descent method according to the difference between the output of the output layer and an actual calibration result;
s4.5) updating current position current _ site to current candidate character
Figure FDA0002493930710000028
The next character position of (a); if the end of the document has been reached, ending and exiting; otherwise the jump is performed to step S4.2).
8. An L STM clause system based on character embedding, characterized by comprising:
the initialization program unit is used for initializing and executing: setting the initial position sensor _ begin of the sentence as the first printable character position of the input document D, and setting the current position current _ site as the initial position sensor _ begin of the sentence;
a current candidate character search program unit for scanning the whole input document D backward from the current position current _ site and obtaining the candidate end character closest to the current position current _ site as the current candidate character
Figure FDA0002493930710000031
Skipping to execute the character string and acquiring a program unit; otherwise, skipping to execute the post-processing program unit;
a character string acquisition program unit for acquiring a current candidate character
Figure FDA0002493930710000032
Previous string StringA, current candidate character
Figure FDA0002493930710000033
The latter string StringB;
a prediction model calling program unit for respectively using the character string StringA and the character string StringB as two inputs of the trained L STM clause model M to obtain the current candidate character
Figure FDA0002493930710000034
Predicted result of (2)
Figure FDA0002493930710000035
A prediction result judging program unit for judging the current candidate character
Figure FDA0002493930710000036
Predicted result of (2)
Figure FDA0002493930710000037
Whether the ending symbol is the ending symbol or not, if the ending symbol is the ending symbol, skipping to execute an ending symbol processing program unit; otherwise, skipping to execute the non-ending symbol processing program unit;
an end character processing program unit for starting the sentence start position sensor _ begin to the current candidate character
Figure FDA0002493930710000038
Outputting the cut-off character string as a complete sentence; judging whether the input document D still has printable characters, if the input document D has no printable characters, indicating that the end of the input document D is reached,the sentence prediction is finished and quitting is carried out; otherwise, setting current _ site and beginning _ begin of sentence as current candidate character
Figure FDA0002493930710000039
Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;
the non-ending symbol processing program unit is used for judging whether the input document D still has printable characters or not, if the input document D does not have printable characters, the ending of the input document D is indicated to be reached, and the post-processing program unit is skipped to execute; otherwise, setting current _ site at current position as current character candidate
Figure FDA00024939307100000310
Then, skipping to execute the current candidate character searching program unit at the position of the next printable character;
a post-processing program unit, for processing the condition of no ending character at the ending part of the document: regarding a character string starting from the beginning of the sentence start position sensor _ begin to the end of the last printable character of the input document D as a sentence and outputting the character string; and (5) predicting the end of clause.
9. An L STM clause system based on character embedding, comprising a computer device, characterized in that the computer device is programmed or configured to perform the steps of the L STM clause method based on character embedding as claimed in any one of claims 1 to 7, or that a storage medium of the computer device has stored thereon a computer program programmed or configured to perform the L STM clause method based on character embedding as claimed in any one of claims 1 to 7.
10. A computer readable storage medium having stored thereon a computer program programmed or configured to perform the character embedding based L STM clause method of any of claims 1-7.
CN202010412860.1A 2020-05-15 2020-05-15 LSTM clause method, system and medium based on character embedding Active CN111460789B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010412860.1A CN111460789B (en) 2020-05-15 2020-05-15 LSTM clause method, system and medium based on character embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010412860.1A CN111460789B (en) 2020-05-15 2020-05-15 LSTM clause method, system and medium based on character embedding

Publications (2)

Publication Number Publication Date
CN111460789A true CN111460789A (en) 2020-07-28
CN111460789B CN111460789B (en) 2023-07-07

Family

ID=71681981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010412860.1A Active CN111460789B (en) 2020-05-15 2020-05-15 LSTM clause method, system and medium based on character embedding

Country Status (1)

Country Link
CN (1) CN111460789B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN113238664A (en) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 Character determination method and device and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006277674A (en) * 2005-03-30 2006-10-12 Advanced Telecommunication Research Institute International Sentence division computer program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006277674A (en) * 2005-03-30 2006-10-12 Advanced Telecommunication Research Institute International Sentence division computer program

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
YAQIN YANG,NIANWEN XUE: "《Chinese Comma Disambiguation for Discourse Analysis》" *
李艳翠;冯文贺;周国栋;朱坤华;: "基于逗号的汉语子句识别研究" *
马伟珍;完么扎西;尼玛扎西;: "藏语句子边界识别方法" *
黄成哲等: "《英文句子边界自动识别》" *
黄河燕,陈肇雄: "基于多策略分析的复杂长句翻译处理算法" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113238664A (en) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 Character determination method and device and electronic equipment
CN113238664B (en) * 2021-05-14 2023-07-25 北京百度网讯科技有限公司 Character determining method and device and electronic equipment
CN113204956A (en) * 2021-07-06 2021-08-03 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device
CN113204956B (en) * 2021-07-06 2021-10-08 深圳市北科瑞声科技股份有限公司 Multi-model training method, abstract segmentation method, text segmentation method and text segmentation device

Also Published As

Publication number Publication date
CN111460789B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN107305768B (en) Error-prone character calibration method in voice interaction
KR101425182B1 (en) Typing candidate generating method for enhancing typing efficiency
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN109753661B (en) Machine reading understanding method, device, equipment and storage medium
CN110750977B (en) Text similarity calculation method and system
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN111460789A (en) L STM sentence segmentation method, system and medium based on character embedding
CN110705253A (en) Burma language dependency syntax analysis method and device based on transfer learning
CN110750984A (en) Command line character string processing method, terminal, device and readable storage medium
CN110968725A (en) Image content description information generation method, electronic device, and storage medium
CN114218926A (en) Chinese spelling error correction method and system based on word segmentation and knowledge graph
JP5441937B2 (en) Language model learning device, language model learning method, language analysis device, and program
CN108664464B (en) Method and device for determining semantic relevance
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
CN113988063A (en) Text error correction method, device and equipment and computer readable storage medium
CN113935317A (en) Text error correction method and device, electronic equipment and storage medium
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
JP7102710B2 (en) Information generation program, word extraction program, information processing device, information generation method and word extraction method
JP4878220B2 (en) Model learning method, information extraction method, model learning device, information extraction device, model learning program, information extraction program, and recording medium recording these programs
WO2010026804A1 (en) Approximate collation device, approximate collation method, program, and recording medium
Ma et al. Bootstrapping structured page segmentation
Kaur et al. Improving the accuracy of tesseract OCR engine for machine printed Hindi documents
CN111078898B (en) Multi-tone word annotation method, device and computer readable storage medium
JP2006053866A (en) Detection method of notation variability of katakana character string

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant