CN111079432B

CN111079432B - Text detection method and device, electronic equipment and storage medium

Info

Publication number: CN111079432B
Application number: CN201911088731.5A
Authority: CN
Inventors: 陈利琴; 刘设伟
Original assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Current assignee: Taikang Insurance Group Co Ltd; Taikang Online Property Insurance Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-07-18
Anticipated expiration: 2039-11-08
Also published as: CN111079432A

Abstract

The embodiment of the invention provides a text detection method, a device, electronic equipment and a computer readable storage medium, belonging to the technical field of computers, wherein the text detection method comprises the following steps: analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected; classifying the target text to be detected through the trained classification model to determine a classification result; if the classification result is of a preset type, comparing the target text to be detected with the identification text, and carrying out compliance detection on the target text to be detected according to the comparison result; if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and carrying out compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text. The embodiment of the invention can improve the accuracy of text detection.

Description

Text detection method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a text detection method, a text detection device, electronic equipment and a computer readable storage medium.

Background

The document is generally provided with reminding or emphasis by thickening, etc., so that how to recognize the thickened text is an important process.

In the related art, the detection of bold text generally uses a text classification method or a conventional rule matching method. Text classification can generally classify a sentence, a paragraph or a document, but cannot directly classify a word or a part of text in a sentence. Therefore, the text classification method may also need to be manually matched to check whether the text is correct when the text is partially bold, so that the recognition rate and recognition efficiency are low, and whether the thickened text is compliant cannot be accurately detected. The rule matching method does not consider semantic information of the text, so that the accuracy of bold text recognition is low.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a text detection method, a text detection device, an electronic apparatus, and a computer readable storage medium, so as to overcome the problems of low accuracy and recognition efficiency in recognizing bold text at least to some extent.

Other features and advantages of embodiments of the invention will be apparent from the following detailed description, or may be learned by the practice of the invention.

According to an aspect of an embodiment of the present invention, there is provided a text detection method including: analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected; classifying the target text to be detected through the trained classification model to determine a classification result; if the classification result is of a preset type, comparing the target text to be detected with the identification text, and carrying out compliance detection on the target text to be detected according to the comparison result; if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and carrying out compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

In an exemplary embodiment of the present invention, before classifying the target text to be detected by the trained classification model, the method further includes: acquiring first sample data, and inputting the first sample data into an enabling layer to generate a corresponding word vector sequence; training the word vector sequence through a long-short-time memory network to obtain context characteristics; obtaining feature vectors of the word vector sequence by adopting maximum pooling operation on the context features; and sequentially inputting the feature vectors into a linear layer and a classification layer to obtain the trained classification model.

In an exemplary embodiment of the present invention, obtaining first sample data, and inputting the first sample data into an enabling layer to generate a corresponding word vector sequence includes: preprocessing the history detection text, and determining the preprocessed history detection text as a positive sample and a negative sample to obtain the first sample data; detecting a text according to the preprocessed history to obtain a word vector; serializing the preprocessed history detection text to obtain a sequence history detection text; and constructing the emmbedding layer according to the word vector and the sequence history detection text, and inputting the first sample data into the emmbedding layer to generate the word vector sequence.

In an exemplary embodiment of the present invention, before the sequence labeling is performed on the target text to be detected through the trained named entity recognition model to determine a text entity of the target text to be detected, the method further includes: acquiring second sample data, wherein the second sample data is obtained according to a sequence labeling rule; inputting the second sample data into a long-short-time memory network to obtain the probability that each word in the second sample data is marked with a sequence label respectively; and inputting the transition probability between the probability and the label into a conditional random field layer to carry out sentence-level sequence labeling so as to obtain the trained named entity recognition model.

In an exemplary embodiment of the present invention, marking the identification text existing in the text to be detected of the target includes: if the identification text is all texts in the target text to be detected, adding a first mark for the identification text, and determining the mark position of the identification text; and if the identification text is a part of text in the text to be detected of the target, adding a second mark for the identification text, and determining the mark position of the identification text.

In an exemplary embodiment of the present invention, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to a comparison result includes: acquiring an identification text in the target text to be detected according to the marking position; if the target text in the target text to be detected is consistent with the identification text, determining that the target text is compliant; and if the target text is inconsistent with the identification text, determining that the target text is not inconsistent.

In an exemplary embodiment of the present invention, performing compliance detection on the target text to be detected according to a comparison result between the text entity and the identification text includes: acquiring a text entity according to a marking position of the text entity; if the text entity is consistent with the identification text, determining that the text entity is compliant; and if the text entity is inconsistent with the identification text, determining that the text entity is not compliant.

According to an aspect of an embodiment of the present invention, there is provided a text detection apparatus including: the identification text determining module is used for analyzing the text to be detected to obtain a target text to be detected and marking the identification text in the target text to be detected; the text classification module is used for classifying the target text to be detected through a trained classification model so as to determine a classification result; the first detection module is used for comparing the target text to be detected with the identification text if the classification result belongs to a preset type, and carrying out compliance detection on the target text to be detected according to the comparison result; and the second detection module is used for inputting the target text to be detected into a trained named entity recognition model to determine a text entity if the classification result does not belong to the preset type, and carrying out compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

According to an aspect of an embodiment of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a text detection method as set forth in any one of the above.

According to an aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the text detection methods described above via execution of the executable instructions.

In the text detection method, the device, the electronic equipment and the computer readable storage medium provided by the embodiment of the invention, the target text to be detected is classified through the trained classification model so as to determine whether the classification result belongs to a preset type; if the classification result is of a preset type, comparing the target text to be detected with the identification text to perform compliance detection; if the classification result does not belong to the preset type, the target text to be detected is subjected to sequence labeling through the trained named entity recognition model to determine the text entity of the target text to be detected, and the target text to be detected is subjected to compliance detection according to the comparison result of the text entity and the identification text. On the one hand, whether the target text to be detected belongs to a preset type can be determined through the trained classification model, and the compliance detection can be performed by comparing the identified target text to be detected with the original marked identification text when the target text to be detected belongs to the preset type, or the compliance detection can be performed by comparing the text entity of the target text to be detected obtained according to the trained named entity identification model and the original marked identification text when the target text to be detected does not belong to the preset type, and the text can be accurately determined to be compliance due to the fact that the text can be identified by the fusion classification model and the named entity identification model, so that the identification accuracy is improved. On the other hand, through the trained classification model and named entity recognition model, the function of automatically recognizing the text can be realized, and the compliance detection is carried out on the text to be detected of the target, so that the manual auditing operation is reduced, the efficiency is improved, and the cost is saved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

fig. 1 schematically shows a flow chart of a text detection method according to an embodiment of the present invention;

FIG. 2 schematically illustrates a schematic diagram of a training classification model according to an embodiment of the invention;

FIG. 3 schematically illustrates a diagram of generating a sequence of word vectors according to an embodiment of the present invention;

FIG. 4 schematically shows a flow chart for comparing a target text to be detected with a logo text according to an embodiment of the present invention;

FIG. 5 schematically illustrates a diagram of training a named entity recognition model according to an embodiment of the invention;

FIG. 6 schematically illustrates a diagram of comparing a text entity with a marking location identifying text according to an embodiment of the present invention;

FIG. 7 schematically illustrates an overall flow diagram of text compliance detection in accordance with an embodiment of the present invention;

fig. 8 schematically shows a block diagram of a text detection device according to an embodiment of the invention;

fig. 9 schematically shows a block diagram of an electronic device for implementing the text detection method described above.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In order to solve the above-mentioned problems, the embodiments of the present invention first provide a text detection method, which can be applied to a processing scenario for performing compliance detection on texts in various documents. The text detection method may be performed by a server, and referring to fig. 1, the text detection method may include step S110, step S120, step S130, and step S140.

Wherein:

in step S110, analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected;

In step S120, classifying the target text to be detected by using a trained classification model to determine a classification result;

in step S130, if the classification result is of a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result;

in step S140, if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

In the technical scheme provided by the example embodiment of the invention, on one hand, whether the target text to be detected belongs to a preset type or not can be determined through the trained classification model, and if so, the compliance detection can be performed by comparing the identified target text to be detected with the originally marked identification text, or if not, the compliance detection can be performed by comparing the text entity of the target text to be detected obtained according to the trained named entity identification model and the originally marked identification text, and because the text can be identified by the fusion classification model and the named entity identification model, whether the text to be identified is compliance can be accurately determined, and the identification accuracy can be improved. On the other hand, through the trained classification model and named entity recognition model, the function of automatically recognizing the text can be realized, and the compliance detection is carried out on the text to be detected of the target, so that the manual auditing operation is reduced, the efficiency is improved, and the cost is saved.

Next, a text detection method in an embodiment of the present invention will be further explained with reference to the accompanying drawings.

In step S110, the text to be detected is parsed to obtain a target text to be detected, and the identification text existing in the target text to be detected is marked.

In the embodiment of the invention, the text to be detected can be online text or offline stored local text, and specifically can be any whole or partial document expressed in a text form, for example, a contract, a written document, a teaching document and the like, and the text to be detected is not particularly limited herein. The text to be detected may be in the form of a word document. The target text to be detected refers to the parsed text to be detected. The identification text may be text in a target text to be detected, which is different from other display forms, for example, bold text, italic text, enlarged text or text with other special display forms in the target text to be detected.

The target text to be detected refers to a document after analysis processing, and the analysis processing mainly comprises sentence dividing and preprocessing operation on the text to be detected. Wherein, the whole document can be rapidly and accurately divided by the program. Further, after the sentence is made on the document, a preprocessing operation may be performed on the text. The preprocessing operation herein includes, but is not limited to, operations of removing special characters, reproduction and transcription, transcription and transcription of English, and the like. Through preprocessing the text, the obtained target text to be detected is more standard and easier to process. Based on this, in the subsequent processing, the target text to be detected refers to the text contained in each sentence.

After the target text to be detected is obtained, marking the identification text existing in the target text to be detected, and obtaining the marking position of the identification text so as to accurately position and mark the position of the identification text. Specifically, marking the identification text may include the following two scenarios: and in the first case, if the identification text is all texts in the texts to be detected of the target, adding a first mark for the identification text, and determining the mark position of the identification text. That is, if all of the sentences are identified text, a first tag may be added to the identified text, which may be, for example, "1" or other numbers or letters, etc. For the labeled text added with the first label, whether some sentences are labeled texts or not may not be judged according to the classification model, but when the model is identified by using the named entity, the whole sentences are possibly identified as one labeled text, so the sentences added with the first label need to be processed as follows: the start position and end position of the identification text are recorded and the mark position thereof is determined as a mark (0, length-1), where length represents the length of the sentence.

And in the second case, if the identification text is a part of text in the text to be detected of the target, adding a second mark for the identification text, and determining the mark position of the identification text. That is, if only a portion of the sentence is the logo text, a second tag may be added to the logo text, which may be, for example, "0" or other numbers or letters different from the first tag, etc. For the sentence to which the second mark is added, for the identification text (the text that is partially thickened) in the sentence, the start position and the end position of the identification text are recorded, and the mark positions thereof are (start_index, end_index), for example, the sentence "when the insurer requests compensation, the following proof and information should be provided to the insurer. ", labeled (0, 5). In addition, for the text other than the text identified in the sentence, the start position and the end position are recorded, and the mark position is (None ).

According to the embodiment of the invention, the position of the identification text can be accurately determined by marking the identification text, so that the target text to be detected can be accurately subjected to compliance detection based on the marking position.

In step S120, the text to be detected of the target is classified by the trained classification model, so as to determine a classification result.

In the embodiment of the invention, the classification model is used for classifying the text to be detected of the target so as to obtain a classification result. The classification results may generally be all of the bold type, part of the bold type, and all of the non-bold type. Specifically, the text to be detected of the target can be input into a trained classification model to obtain a classification result belonging to a certain type. To obtain more accurate classification results, a trained classification model may be obtained prior to step S120.

A classification model training flowchart is schematically shown in fig. 2, and referring to fig. 2, the process of obtaining a trained classification model may be as follows:

step S210, first sample data are acquired, and the first sample data are input into an enabling layer to generate a corresponding word vector sequence.

In the embodiment of the invention, the first sample data refers to a part of text which is obtained from the historical detection text and is used for training the classification model, namely the first sample data is text data of which whether the text is bold text or not is known. History detection text refers to a history-detected document for which it is known whether it is bold text.

After obtaining the first sample data, the first sample data may be input into the labeling layer to generate a corresponding word vector sequence, as shown in fig. 3, specifically including the following steps S310 to S340, where:

Step S310, preprocessing the history detection text, and determining the preprocessed history detection text as a positive sample and a negative sample, so as to obtain the first sample data.

In the embodiment of the invention, before the first sample data is acquired, the history detection text can be processed. Specifically, the history detection text may be subjected to clause, and the history detection text after the clause is formatted; dividing the formatted history detection text according to words to obtain preprocessed history detection text. Specifically, all documents are divided into sentences to be used as corpus of training models; loading corpus to perform data formatting treatment, wherein the formatting comprises removing special characters, reproduction, transcription, shorthand, english capitalization, transcription and the like; and dividing all the corpus after formatting according to words to obtain a preprocessed history detection text. In this way, the preprocessed history detection text is also a word-segmented sentence.

After obtaining the preprocessed history test text, first sample data may be obtained from the text. The first sample data may specifically include positive samples and negative samples. Specifically, sentences of bold text including all the history detection text may be labeled as positive samples, and other sentences may be labeled as negative samples.

For example, extracting all clauses in the insurance contract, and dividing all the insurance clauses into sentences to serve as an initial corpus of a training model; the generated text of each sentence is thickened according to the whole sentence, the part of the sentence text is thickened, the whole sentence is not thickened into three types, and the three types are respectively marked as A_bold, P_bold and N_bold. Formatting the classified history detection text, including removing special characters, reproduction and transcription, english upper case and lower case, and the like. All the corpus after formatting is segmented according to words to obtain a preprocessed history detection text. Next, all the bolded sentences (a_bold) in the preprocessed history detection text may be labeled as positive samples, and all the other sentences (p_bold, n_bold) may be labeled as negative samples to obtain the first sample data.

Step S320, detecting the text according to the preprocessed history to obtain the word vector.

In the embodiment of the invention, the word vector is trained by the preprocessed history detection text through a word2vec word embedding algorithm. After data preprocessing and data labeling are carried out on training expectation, word vectors are required to be trained in advance, word vectors are trained in advance on the preprocessed historical detection text segmented according to words through a word2vec word embedding algorithm, and the word vectors are used for training a classification model, a named entity recognition model and an application stage. The network is represented by words and requires guessing the input words in adjacent locations, and the order of the words is unimportant under the word bag model assumption in word2 vec. After training is completed, the word2vec model may be used to map each word to a vector, which may be used to represent word-to-word relationships, the vector being the hidden layer of the neural network.

Step S330, serializing the preprocessed history detection text to obtain a sequence history detection text.

In the embodiment of the invention, serialization refers to a process of converting state information of an object into a form that can be stored or transmitted. Serialization herein refers to the fact that a word that is divided into words may be represented by a number, and thus a sequence history detection text may be obtained.

And step S340, constructing an enabling layer according to the word vector and the sequence history detection text, and inputting the first sample data into the enabling layer to generate the word vector sequence.

In the embodiment of the invention, after the word vector and the sequence history detection text are obtained, an embedding layer of the classification model can be constructed according to the word vector and the sequence history detection text. The classification model is a neural network model, which specifically consists of an enabling layer, a Bi-directional long-short-time memory model (Bi-LSTM) layer, a max pooling layer, a linear layer and a classification layer softmax.

The linear emmbeding layer maps the input word vector into a distributed word vector through a shared matrix, that is, the emmbeding layer expresses the word in terms of vectors. After the ebedding layer is constructed, first sample data composed of marked positive samples and negative samples can be input into the ebedding layer to generate word vector sequences corresponding to the positive samples and the negative samples.

With continued reference to fig. 2, in step S220, the word vector sequence is trained via a long-short-term memory network to obtain a context feature.

In the embodiment of the invention, the long-short-time memory model is a bidirectional long-short-time memory model and consists of a forward LSTM and a backward LSTM. The method utilizes a bidirectional long-short-time memory model to extract the contextual characteristics of word vector sequences (sentences), and the process is a coding process, and comprises the following specific steps:

for the word vector sequence (x ₁ ,x ₂ ,...x _n ) After performing LSTM encoding processing from left to right and from right to left, respectively, the hidden layer state of each time step in two directions is obtained, and the forward hidden layer output is recorded asThe backward hidden layer output is +.>The calculation formula of the LSTM cell includes formulas (1) to (5):

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i ) Formula (1)

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f ) Formula (2)

c _t ＝f _t c _t-1 +i _t tanh(W _xc x _t +W _hc h _t-1 +b _c ) Formula (3)

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o ) Formula (4)

h _t ＝o _t tanh(c _t ) Formula (5)

Where σ is the logistic regression activation function, x _t Is to obtain word vector of time t, i _t 、f _t 、o _t Respectively representing an input door, a forgetting door and an output door at the moment t, c _t And c _t-1 Memory flow states of the cell units at time t and time t-1, h _t The hidden layer vector at time t is indicated. b _i 、b _f 、b _c 、b _o Respectively an input door, a forget door and an output door. The bias parameters of the memory cells, the subscripts of the weight matrix W have special significance, e.g., W _hi Representing a weight matrix connecting hidden layers to input gates.

In order to fully utilize the context information of each moment of the text, the forward information and the backward information of the hidden layer are spliced together to be used as the output of the hidden layer at the moment, and the output is expressed as follows:

in step S230, a max pooling operation is employed on the contextual features to obtain feature vectors for the sequence of word vectors.

In the embodiment of the invention, max-pooling (max-pooling) is the point with the largest value in the local acceptance domain. The max-working operation is used on the Bi-LSTM layer to obtain a feature representation of the input word vector sequence, which can extract the most useful feature of the word vector sequence, i.e., feature vector.

In step S240, the feature vectors are sequentially input into a linear layer and a classification layer to obtain the trained classification model.

In the embodiment of the invention, the feature vector corresponding to the obtained word vector sequence is input into the linear layer first, and then is input into the classification layer, so that the weight parameter of each layer of the neural network is adjusted until the classification result of the first sample data is consistent with the result of manual classification, and a trained classification model can be obtained.

With continued reference to fig. 1, in step S130, if the classification result is of a preset type, comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to the comparison result.

In the embodiment of the present invention, based on step S120, it may be first determined whether the obtained classification result belongs to a preset type, and if the classification result is the preset type, the classification result may be marked with "1". The preset type may specifically represent a bold type, for example, all texts in the target text to be detected belong to bold texts.

If the classification result of the target text to be detected belongs to the preset type, the target text to be detected may be compared with the identification text marked in step S110, so as to perform compliance detection on the target text to be detected. Compliance detection herein refers to determining whether the target text to be detected is bold text and whether the bold text in the target text to be detected is correct. A flow chart schematically showing comparison of the target text to be detected and the identification text is shown in fig. 4, and referring to fig. 4, the flow chart mainly includes steps S410 to S430, wherein:

Step S410, the identification text is obtained according to the mark position, and whether the target text in the target text to be detected is consistent with the identification text is judged.

In this step, the target text refers to bold text identified from the target text to be detected through the trained classification model, that is, the target text is a text marked as "1" in the classification result obtained according to the classification model. The identification text that recognizes the preceding mark can be accurately acquired based on the mark position, for example, the identification text is acquired based on the start position and end position mark (0, length-1). And further comparing the acquired target text with a predetermined identification text to see whether the acquired target text and the predetermined identification text are identical.

And step S420, if the target text is consistent with the identification text, determining that the target text is compliant.

In this step, if the text marked as "1" in the obtained target text to be detected is "insurance applicant", the bold text between the start position marked in advance and the end position marked (0, 5) is "insurance applicant", the two are considered to be the same, and it can be determined that the target text is compliant. In determining compliance, the label "compliance" may be used to denote.

Step S430, if the target text is inconsistent with the identification text, determining that the target text is not compliant.

In this step, if the text marked as "1" in the obtained target text to be detected is "insurance applicant", the bold text between the start position marked in advance and the end position marked (0, 5) is "insurance applicant", the two are considered to be different, and it can be determined that the target text is not compliant. In determining compliance, the label "non-compliance" may be used to indicate.

It should be noted that, the marking position of the target text and the marking position of the identification text may be compared, so as to determine whether the content is the same when the marking positions are the same, which is not described in detail herein.

With continued reference to fig. 1, in step S140, if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

In the embodiment of the invention, if the target text to be detected does not belong to the preset type, the compliance detection is continuously carried out on the target text to be detected. Specifically, the text to be detected of the target can be input into a trained named entity recognition model for sequence labeling so as to judge whether a text entity is obtained or not. If the text entity is identified, the text entity indicates that a part of texts in the target text to be detected need to be set to be of a preset type. For example, if there are bold text entities identified, this means that some text in the sentence needs to be bolded. And further, performing compliance detection on the target text to be detected again according to the comparison result of the text entity and the identification text, and fusing the classification model and the named entity recognition model to accurately recognize the target text to be detected so as to accurately recognize the bold text therein and improve the accuracy of compliance detection.

Named entity recognition model refers to recognizing entities with specific categories, such as person names, place names, organization names, proper nouns, and the like, from the text to be detected of the target. The problem of naming an entity recognition model is generally abstracted to be a sequence annotation problem, which is a problem of assigning a specific label to each symbol in a sequence, and essentially classifying each element in the sequence according to the context.

In order to obtain a more accurate classification result, a trained named entity recognition model may be obtained before step S140, so as to determine the text entities existing in the target text to be detected through the trained named entity recognition model. The text entity is determined by specifically identifying the target, for example, if the identified target is bold text, the text entity is bold text entity; if the recognition target is italic text, the text entity is a italic text entity. In the embodiment of the invention, the text entity is taken as a bold text entity or a bold text entity as an example for explanation.

A schematic diagram of training a named entity recognition model is schematically shown in fig. 5, and referring to fig. 5, the method mainly includes steps S510 to S530, where:

Step S510, obtaining second sample data, wherein the second sample data is obtained according to a sequence labeling rule.

In this step, the second sample data is used for training a named entity recognition model, and the second sample data is obtained through a sequence labeling mode. Specifically, sentences (partially bold text) which are partially thickened in the preprocessed history detection text are marked according to a BIO marking mode of marking the named entity identification data. The labels are in the form of B-BOLD, I-BOLD, O, representing the beginning symbol of BOLD text, the non-BOLD text symbol, such as sentences: the following certificates and data should be provided to the insurer at the time of claiming the reimbursement from the insured applicant. Is marked as: B-BOLD, I-BOLD, I-BOLD, I-BOLD, I-BOLD, I-BOLD, O, O, O, O, O, O, O, O, O, O, O, O, O.

Step S520, inputting the second sample data into a long-short-time memory network, so as to obtain the probability that each word in the second sample data is marked with a sequence label.

In this step, the named entity recognition model consists of an empdding layer, a Bi-directional long-short-time memory model (Bi-LSTM) layer and a conditional random field layer. The input and output of the traditional neural network are independent, but in the sequence labeling, the later output and the earlier content are related, namely the output labels are strongly dependent, so that the same ebedding layer and Bi-LSTM network structure as the classification model are also adopted. The second sample data obtained in step S510 may be input into the Bi-LSTM network structure, and the output value of the long-short-term memory network indicates that: the probability that each word included in the second sample data is labeled with a sequence tag (B-BOLD, I-BOLD, O), respectively.

Obtaining the output of the hidden layerDirectly accessing a linear layer to perform linear transformation, and transforming each hidden layer state into a vector of 1Xk dimension +.>If a softmax function is directly accessed, each vocabulary is respectively and independently marked with a sequence, however, the sequence marking cannot be regarded as a simple classification problem, because each word is mutually influenced, and if the classification problem is regarded as a classification problem, information is lost; it is the dependency between tags, such as I-BOLD cannot be followed by the start identifier B-MOV of another entity, so a conditional random field layer is introduced to model the output of the whole sentence.

And step S530, inputting the transition probability between the probability and the label into a conditional random field layer for sentence-level sequence labeling so as to obtain the trained named entity recognition model.

In this step, the transition probability between tags refers to the probability that a certain tag is converted to another tag, for example, the probability that tag B-BOLD is converted to tag I-BOLD. Accessing a conditional random field layer, and marking a sentence-level sequence by fusing an output value of a Bi-LSTM network structure and transition probability between labels; the specific process comprises the following steps:

Defining a sentence X as a score labeled with a sequence tag l as:

wherein A is the transition probability between labels of a conditional random field layer, f is the output value of a Bi-LSTM network structure as a learned parameter, and the probability obtained by normalizing the above formula through a softmax layer is:

therefore, in training the model, the optimization objective is to minimize the log-likelihood function, as follows:

when the model is predicted, a dynamic programming Viterbi algorithm is used for solving an optimal path, and the formula is as follows:

the conditional random field layer carries out sentence-level sequence labeling by combining the output of the Bi-LSTM network structure and the transition probability between labels until the labeling result is consistent with the sequence labeling of the historical detection text manually determined in advance, so as to obtain a trained named entity recognition model with better performance.

Fig. 6 schematically illustrates that the compliance detection of the target text to be detected by the comparison result of the text entity and the identification text includes steps S610 to S630, where:

step S610, acquiring a text entity according to the marking position of the text entity;

step S620, if the text entity is consistent with the identification text, determining that the text entity is compliant;

Step S630, if the text entity is inconsistent with the identification text, determining that the text entity is not compliant.

In the embodiment of the invention, the trained named entity recognition model is used for carrying out bold text recognition and outputting a label sequence; the tag sequence is further converted to a text entity, and a tag location of the text entity is obtained, which may include a start location and an end location. If there is no text entity, the start and end positions are marked (None ).

On this basis, the text entity can be rapidly extracted according to the marking position of the text entity, and compared with the identification text obtained in step S110, if the text entity is completely consistent with the identification text, the compliance of the text entity is determined, and a label of "compliance" is returned. If the text entity is inconsistent with the identified text, determining that the text entity is not compliant, and returning a "non-compliant" tag.

In the embodiment of the invention, the trained classification model and the trained named entity recognition model are used for recognizing the target text to be detected according to the comparison result of the target text to be detected and the identification text, so that the bold text in the target text to be detected can be accurately recognized, and whether the bold text is compliant can be accurately determined. In addition, whether the sentences should be totally thickened is detected by using the classification model, the text with the thickened part in the sentences is detected by using the named entity recognition method, and the named entity recognition method and the text classification method are fused, so that the method is intelligently applied to the bold text detection in the text compliance detection, the labor cost and the time cost of text auditing are greatly reduced, and the operation efficiency is improved.

An overall flow chart of compliance detection of text is schematically shown in fig. 7, and with reference to fig. 7, mainly comprises the following steps:

and step 701, analyzing the text to be detected to obtain a target text to be detected.

In step S702, the target text to be detected is input into a trained classification model to determine whether the target text is of a preset type, where the preset type may be all text belonging to bold text.

In step S703, if the target text to be detected is determined to be bold text, the bold text is compared with the marked mark text to determine whether the target text to be detected is compliant.

Step S704, if the target text to be detected is judged to be a non-bold text, the target text to be detected is input into a trained named entity recognition model.

Step S705, determining the text entity through the trained named entity recognition model. The text entities are bold text entities.

Step S706, the text entity is compared with the labeled identification text in advance to determine whether the text entity is compliant.

In the technical scheme in fig. 7, whether the target text to be detected belongs to a preset type can be determined through a trained classification model, and if so, compliance detection is performed by comparing the identified target text to be detected with the originally marked identification text, or if not, compliance detection is performed by comparing the text entity of the target text to be detected obtained according to a trained named entity identification model with the originally marked identification text, and because the text can be identified by the fusion classification model and the named entity identification model, whether the text to be identified is compliance can be accurately determined, and the identification accuracy is improved.

In an embodiment of the present invention, a text detection device is further provided, and referring to fig. 8, the device 800 mainly includes:

the identification text determining module 801 may be configured to parse a text to be detected to obtain a target text to be detected, and mark an identification text existing in the target text to be detected;

the text classification module 802 may be configured to classify the target text to be detected by using a trained classification model, so as to determine a classification result;

the first detection module 803 may be configured to compare the target text to be detected with the identification text if the classification result is of a preset type, and perform compliance detection on the target text to be detected according to the comparison result;

the second detection module 804 may be configured to input the target text to be detected into a trained named entity recognition model to determine a text entity if the classification result does not belong to the preset type, and perform compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text.

It should be noted that, each functional module of the text detection device in the embodiment of the present invention is the same as the steps of the above-mentioned exemplary embodiment of the text detection method, so that a detailed description thereof is omitted herein.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods of the present invention are depicted in the accompanying drawings in a particular order, this is not required to either imply that the steps must be performed in that particular order, or that all of the illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

In an exemplary embodiment of the present invention, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 900 according to such an embodiment of the invention is described below with reference to fig. 9. The electronic device 900 shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, and a bus 930 connecting the different system components (including the storage unit 920 and the processing unit 910).

Wherein the storage unit stores program code that is executable by the processing unit 910 such that the processing unit 910 performs steps according to various exemplary embodiments of the present invention described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 910 may perform the steps as shown in fig. 1.

The storage unit 920 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 9201 and/or cache memory 9202, and may further include Read Only Memory (ROM) 9203.

The storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 930 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 900 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 900, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 950. Also, electronic device 900 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 960. As shown, the network adapter 960 communicates with other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 900, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiment of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the text detection method according to the embodiment of the present invention.

In an exemplary embodiment of the present invention, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

A program product for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. A text detection method, comprising:

analyzing a text to be detected to obtain a target text to be detected, and marking an identification text existing in the target text to be detected;

classifying the target text to be detected through the trained classification model to determine a classification result;

If the classification result is of a preset type, comparing the target text to be detected with the identification text, and carrying out compliance detection on the target text to be detected according to the comparison result; the preset type is bold text;

if the classification result does not belong to the preset type, inputting the target text to be detected into a trained named entity recognition model to determine a text entity, and carrying out compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text; and the compliance detection is used for judging whether the target text to be detected is a bold text or not and whether the bold text in the target text to be detected is correct or not.

2. The text detection method of claim 1, wherein prior to classifying the target text to be detected by the trained classification model, the method further comprises:

acquiring first sample data, and inputting the first sample data into an enabling layer to generate a corresponding word vector sequence;

training the word vector sequence through a long-short-time memory network to obtain context characteristics;

obtaining feature vectors of the word vector sequence by adopting maximum pooling operation on the context features;

And sequentially inputting the feature vectors into a linear layer and a classification layer to obtain the trained classification model.

3. The text detection method of claim 2, wherein obtaining first sample data and inputting the first sample data into an enabling layer to generate a corresponding word vector sequence comprises:

preprocessing the history detection text, and determining the preprocessed history detection text as a positive sample and a negative sample to obtain the first sample data;

detecting a text according to the preprocessed history to obtain a word vector;

serializing the preprocessed history detection text to obtain a sequence history detection text;

and constructing the emmbedding layer according to the word vector and the sequence history detection text, and inputting the first sample data into the emmbedding layer to generate the word vector sequence.

4. The text detection method of claim 1, wherein prior to sequence labeling the target text to be detected by a trained named entity recognition model to determine a text entity of the target text to be detected, the method further comprises:

acquiring second sample data, wherein the second sample data is obtained according to a sequence labeling rule;

Inputting the second sample data into a long-short-time memory network to obtain the probability that each word in the second sample data is marked with a sequence label respectively;

and inputting the transition probability between the probability and the label into a conditional random field layer to carry out sentence-level sequence labeling so as to obtain the trained named entity recognition model.

5. The text detection method according to claim 1, wherein marking the identification text present in the target text to be detected includes:

if the identification text is all texts in the target text to be detected, adding a first mark for the identification text, and determining the mark position of the identification text;

and if the identification text is a part of text in the text to be detected of the target, adding a second mark for the identification text, and determining the mark position of the identification text.

6. The text detection method according to claim 5, wherein comparing the target text to be detected with the identification text, and performing compliance detection on the target text to be detected according to a comparison result comprises:

acquiring an identification text in the target text to be detected according to the marking position;

If the target text in the target text to be detected is consistent with the identification text, determining that the target text is compliant;

and if the target text is inconsistent with the identification text, determining that the target text is not inconsistent.

7. The text detection method according to claim 1, wherein performing compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text includes:

acquiring a text entity according to a marking position of the text entity;

if the text entity is consistent with the identification text, determining that the text entity is compliant;

and if the text entity is inconsistent with the identification text, determining that the text entity is not compliant.

8. A text detection device, comprising:

the identification text determining module is used for analyzing the text to be detected to obtain a target text to be detected and marking the identification text in the target text to be detected;

the text classification module is used for classifying the target text to be detected through a trained classification model so as to determine a classification result;

the first detection module is used for comparing the target text to be detected with the identification text if the classification result belongs to a preset type, and carrying out compliance detection on the target text to be detected according to the comparison result; the preset type is bold text;

The second detection module is used for inputting the target text to be detected into a trained named entity recognition model to determine a text entity if the classification result does not belong to the preset type, and carrying out compliance detection on the target text to be detected according to a comparison result of the text entity and the identification text; and the compliance detection is used for judging whether the target text to be detected is a bold text or not and whether the bold text in the target text to be detected is correct or not.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a text detection method as claimed in any one of claims 1-7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the text detection method of any of claims 1-7 via execution of the executable instructions.