CN111695054A

CN111695054A - Text processing method and device, information extraction method and system, and medium

Info

Publication number: CN111695054A
Application number: CN202010537652.4A
Authority: CN
Inventors: 沈大框; 张莹; 陈成才
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-09-22

Abstract

The text processing method comprises the following steps of: identifying a reference relationship existing in the initial text; performing reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text; respectively carrying out entity recognition processing on each reference resolution text; acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order; and merging the filtered reference resolution sentences to obtain a semantic processing text. By adopting the method, clearer semantic information can be obtained, and the accuracy of the information extraction result is improved.

Description

Text processing method and device, information extraction method and system, and medium

Technical Field

The embodiment of the specification relates to the technical field of information processing, in particular to a text processing method and device, an information extraction method and system, and a medium.

Background

In the era of explosion of internet Information, in order to quickly acquire required Information from mass Information of the internet, reasonable screening of internet Information is required, and Information Extraction (IE) technology is generated. The information extraction technology is to structure unstructured text so as to extract information on entities (entitys), relations (relationships), events (events), and the like from the text.

In the information extraction process, the semantic relation between events and entities is often scattered in different positions of a text, and the entities can usually have a plurality of different expression modes, so that the semantic information in the text is unclear, and the problem of omission or extraction error can occur when the information is extracted from the text.

Disclosure of Invention

In view of the above, in one aspect, embodiments of the present specification provide a text processing method, a text processing apparatus, and a text processing medium, which can obtain clearer semantic information in a text information extraction process.

On the other hand, the embodiments of the present specification further provide an information extraction method, system, and medium, which can improve the accuracy of the information extraction result.

An embodiment of the present specification provides a text processing method, including:

identifying a reference relationship existing in the initial text;

performing reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text;

respectively carrying out entity recognition processing on each reference resolution text;

acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order;

and merging the filtered reference resolution sentences to obtain a semantic processing text.

Optionally, the performing a reference resolution process on the initial text based on the identified reference relationship to obtain a processed reference resolution text includes:

acquiring the components of the reference relationship from the initial text, and selecting the main components of the reference relationship from the components;

and acquiring part or all of the components with the same reference relation with the main components from the initial text, replacing the components with the main components, and obtaining a processed reference resolution text.

Optionally, the performing entity identification processing on each reference resolution text respectively includes:

respectively inputting each reference resolution text into a preset entity recognition model to obtain an entity prediction probability matrix of each reference resolution text;

determining distribution positions which accord with a preset first condition in the entity prediction probability matrix of each reference resolution text, and taking components of the corresponding distribution positions in each reference resolution text as entities to obtain an entity recognition result.

Optionally, before the performing entity identification processing on each reference resolution text, the method further includes:

inputting a preset training corpus and an entity true probability matrix of the training corpus into the entity recognition model for training to obtain an entity prediction probability matrix of the training corpus;

performing error calculation based on the entity prediction probability matrix of the training corpus and the entity real probability matrix of the training corpus to obtain a result error value;

and if the result error value meets a preset training completion condition, finishing the training of the entity recognition model, otherwise, adjusting the parameters of the entity recognition model, and inputting the training corpus and the entity true probability matrix of the training corpus into the entity recognition model after the parameters are adjusted to train until the entity recognition model completes the training.

Optionally, the entity prediction probability matrices for each reference resolution text include: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text;

the determining the distribution positions in the entity prediction probability matrix of each reference resolution text, which meet a preset first condition, and taking the components of the corresponding distribution positions in each reference resolution text as entities, includes:

comparing the first prediction probability vector with a preset first threshold value, determining distribution positions with probability values larger than the first threshold value in the first prediction probability vector, and obtaining entity prediction initial position distribution information of each reference resolution text;

comparing the second prediction probability vector with a preset second threshold value, determining distribution positions of which the probability values in the second prediction probability vector are larger than the second threshold value, and obtaining entity prediction end position distribution information of each reference resolution text;

and obtaining entity distribution position intervals of each reference resolution text based on the entity prediction starting position distribution information and the entity prediction ending position distribution information of each reference resolution text, and obtaining components in the corresponding distribution position intervals in each reference resolution text as entities.

Optionally, the entity prediction probability matrix of the corpus includes: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus; the entity true probability matrix of the training corpus comprises: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;

the error calculation is performed on the entity prediction probability matrix based on the corpus and the entity real probability matrix based on the corpus to obtain a result error value, and the error calculation comprises the following steps:

error calculation is performed by using the following loss function to obtain a result error value:

wherein, y_siThe ith probability value in the first real probability vector; y is_eiThe ith probability value in the second real probability vector;

is the ith probability value in the third prediction probability vector;

is the ith probability value in the fourth prediction probability vector; i is a natural number.

An embodiment of the present specification further provides an information extraction method, including:

obtaining a semantic processing text by adopting the text processing method of any one of the embodiments;

and performing information identification processing on the semantic processing text, and extracting corresponding components in the semantic processing text based on an information identification result.

The embodiment of the specification also provides a text processing device, which comprises a first memory and a first processor; wherein the first memory is adapted to store one or more computer instructions, which when executed by the first processor, perform the steps of the text processing method according to any of the above embodiments.

The embodiment of the present specification further provides an information extraction system, which includes a second memory and a second processor; wherein the second memory is adapted to store one or more computer instructions that when executed by the second processor perform the steps of:

and performing information identification processing on the semantic processing text obtained by the text processing equipment, and acquiring corresponding components in the semantic processing text based on an information identification result.

The present specification further provides a computer-readable storage medium, on which computer instructions are stored, where the computer instructions are executed to perform the steps of the text processing method or the information extraction method according to any one of the foregoing embodiments.

By adopting the text processing scheme of the embodiment of the specification, after the initial text is subjected to the reference resolution processing, the entity recognition processing can be respectively carried out on each reference resolution text, the reference resolution sentences in the same order in each reference resolution text are obtained, the reference resolution sentences in the same order are screened based on the entity recognition results of the reference resolution sentences in the same order, and then the screened reference resolution sentences are combined, so that the semantic processing text can be obtained. According to the scheme, by identifying the reference relations in the text, the components with the reference relations in the text can be associated, subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by the reference is reduced, the reference resolution sentences in all orders are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the text at a processed document level is processed conveniently, the text information quality is improved, and various reference resolution sentences contained in the obtained semantic processed text can be increased, the semantic information in the text is enriched, and clearer semantic information can be obtained in the text information extraction process.

Further, after obtaining the components of the reference relationship from the initial text, the main components of the reference relationship may be selected from the initial text, and a part or all of the components having the same reference relationship as the main components may be obtained from the initial text and replaced with the main components, thereby obtaining a processed reference resolution text. By adopting the scheme, the main components are selected and replaced for other components with the same reference relationship, so that the expression modes of the components with the same reference relationship in the text can be unified, and the subsequent entity identification processing is facilitated.

Further, an entity prediction probability matrix of each statement can be obtained through a preset entity recognition model, the first prediction probability vector is compared with a preset first threshold, the second prediction probability vector is compared with a preset second threshold, and an entity distribution position interval of each statement can be obtained, so that components in the corresponding distribution position interval in each statement are used as entities. By adopting the scheme, the initial position and the end position of the entity are respectively judged, so that the entity of a single character can be obtained, and the threshold value is used as a judgment condition, so that each entity can be obtained through the distribution position interval under the complex condition that the text contains a plurality of entities, and the accuracy rate of obtaining the entity is increased.

By adopting the information extraction scheme of the embodiment of the description, after the initial text is subjected to the text processing scheme to obtain the semantic processing text, the information identification processing can be performed on the semantic processing text, and the corresponding components in the semantic processing text can be obtained. By adopting the scheme, the semantic information in the text for information extraction is optimized, clearer semantic information can be obtained from the semantic processing text, and the reliability of the information extraction task is improved, so that the accuracy of the information extraction result can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a text processing method in an embodiment of the present specification;

FIG. 2 is a flowchart of a method for obtaining a reference resolution text entity in an embodiment of the present specification;

FIG. 3 is a flow chart of a method for training an entity recognition model in an embodiment of the present disclosure;

fig. 4 is a flowchart of an information extraction method in an embodiment of the present specification.

Detailed Description

In the era of explosion of internet Information, in order to quickly acquire required Information from mass Information of the internet, reasonable screening of internet Information is required, and Information Extraction (IE) technology is generated. The information extraction technology is to structure unstructured text so as to extract specific information, such as Entity (Entity), relationship (relationship), Event (Event), and the like, from the text.

In the information extraction process, semantic relations between events and entities are often scattered in different positions of a text, and the entities can usually have a plurality of different expression modes.

For example, in information extraction, entities in semantic relationships may appear in text in a reference form, since the text may include several paragraphs, a reference (indication) for referring to the entities may exist in each paragraph, and even for the continuity and smoothness of context, some entities may be omitted, thereby resulting in that all semantic information in the text cannot be recognized in information extraction.

In view of the above problems, embodiments of the present specification provide a text processing scheme, where after performing reference resolution processing on the initial text, entity identification processing may be performed on each reference resolution text, and after obtaining reference resolution sentences in the same order in each reference resolution text, the reference resolution sentences in the same order may be screened based on the entity identification results of the reference resolution sentences in the same order, and then the screened reference resolution sentences are merged, so as to obtain a semantic processing text with clear semantics.

In order to make the embodiments of the present disclosure more clearly understood and implemented by those skilled in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.

It is to be understood that the embodiments described herein are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person skilled in the art without making any inventive step based on the embodiments in this specification shall fall within the scope of protection of this specification.

Referring to a flowchart of a text processing method shown in fig. 1, in an embodiment of this specification, the method may specifically include the following steps:

s11, identifying the reference relationship existing in the initial text.

In a specific implementation, the initial text may be a sentence-level text or a document-level text (also referred to as a chapter-level text). The initial text is subjected to the reference decomposition processing through the existing reference decomposition tool or the reference decomposition algorithm, the components with the same reference relationship are identified, and the existence of the reference relationship in the initial text can be determined.

Wherein, the reference decomposition tool may include: a natural language processing tool such as Stanford CoreNLP (a natural language processing tool developed by Stanford university) can perform a reference decomposition process on the original text by using an applicable code. And, the identified components having the same reference relationship may include: at least one of nouns, pronouns, and zero pronouns.

And S12, performing reference resolution processing on the initial text based on the identified reference relationship to obtain a corresponding reference resolution text.

Wherein the reference resolution process may include: pre-finger digestion treatment and back-finger digestion treatment. Through the reference resolution processing, at least one of nouns, pronouns and zero pronouns with the same reference relationship in the initial text can be resolved, so that the components with the same reference relationship can adopt a uniform expression mode.

And S13, respectively carrying out entity recognition processing on each reference resolution text.

In a specific implementation, entities may be classified into different types according to actual situations. For example, according to the grammar rule, the entity may be divided into a subject, a predicate, an object, and the like, and according to the part-of-speech rule, the entity may be divided into a noun, a verb, a preposition, and the like. Before the entity identification processing is performed, the type of the identified entity may be set, for example, an entity that identifies the type of a subject may be set, or an entity that identifies the type of a noun may be set.

S14, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order.

In specific implementation, a plurality of reference resolution texts can be obtained through the reference resolution processing, the number of the reference resolution sentences is increased by multiple times, and particularly, after the initial texts at the document level are subjected to the reference resolution processing, the number of the reference resolution sentences is increased sharply. At this time, the sentences in the same order in the reference resolution texts can be acquired through longitudinal alignment, and the reference resolution sentences in the same order are subjected to screening processing.

For example, the initial text may include three initial sentences, and after the reference resolution, the first reference resolution text a also includes three reference resolution sentences a₁，a₂，a₃I.e. A ═ a₁，a₂，a₃And the second reference resolution text B comprises three reference resolution sentences B₁，b₂，b₃I.e. B ═ B₁，b₂，b₃Longitudinally aligning each reference resolution statement in the first reference resolution text A and the second reference resolution text B, namely a₁And b₁Alignment, a₂And b₂Alignment, and a₃And b₃Alignment is carried out, so that the association among the referring resolution sentences in different orders in the different referring resolution texts can be established, and when the referring resolution sentences in the first order need to be deleted, a in the first order can be acquired₁And b₁And by analogy, the sentences in the same order in the resolution texts are conveniently and quickly acquired.

Then, it is possible to screen out the statements that do not meet the condition based on the entity recognition results that refer to the resolved statements in the same order, thereby reducing the data amount.

And S15, merging the filtered reference resolution sentences to obtain a semantic processing text.

And the number of the filtered referring resolution sentences is not less than the number of the sentences in the initial text.

According to the scheme, by identifying the reference relations in the text, the components with the reference relations in the text can be associated, subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by the reference is reduced, the reference resolution sentences in all orders are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the text at a processed document level is processed conveniently, the text information quality is improved, and various reference resolution sentences contained in the obtained semantic processed text can be increased, the semantic information in the text is enriched, and clearer semantic information can be obtained in the text information extraction process.

In a specific implementation, if a plurality of reference relationships are identified or a plurality of corresponding components exist in the reference relationships, the initial text can be selectively subjected to reference resolution processing, so that at least one reference resolution text can be obtained.

For example, when a plurality of reference relationships are identified, components corresponding to a part of the reference relationships may be obtained from the initial text to perform reference resolution processing to obtain one reference resolution text, or all components corresponding to the reference relationships may be obtained from the initial text to perform reference resolution processing to obtain another reference resolution text.

For another example, when there are multiple corresponding components of the reference relationship, the components of the reference relationship may be obtained from the initial text, and the principal component of the reference relationship may be selected from the initial text, and a part or all of the components having the same reference relationship as the principal component may be obtained from the initial text and replaced by the principal component, so as to obtain a corresponding reference resolution text.

According to a preset selection rule, at least one of the obtained components of the reference relationship can be selected as a main component, and a reference resolution text corresponding to each main component is obtained. If the initial text comprises at least two components with the same reference relationship with the main components, at least one component with the same reference relationship with the main components can be obtained from the initial text and replaced by the main components to obtain the corresponding reference resolution text.

By adopting the scheme, the main components are selected and replaced for other components with the same reference relationship, so that the expression modes of the components with the same reference relationship in the text can be unified, and the subsequent entity identification processing is facilitated.

In one embodiment of the present specification, the initial text may be: { a small, clear schoolbag is a birthday gift that his dad sends to him. Wherein the symbols "{ }" are only used to limit the content range of the example, and are not an essential part in representing the content of the initial text, and those skilled in the art can use other symbols which are not easy to be confused to limit the content range of the initial text, and the following "{ }" is the same as above.

Through the reference decomposition processing, the components with the same reference relationship are obtained as the Xiaoming, the first other and the second other in the initial text, and one reference relationship exists in the initial text. Based on the reference relationship, if the preset selection rule is to select the component with the longest length in the same reference relationship as the main component, the 'Xiaoming' is used as the main component of the reference relationship, and at least one of the first 'he' and the second 'he' in the initial text can be selected to perform the reference resolution processing, so as to obtain the corresponding reference resolution text.

As an alternative example, replacing the first "he" in the initial text with "xiaoming" results in the first referring resolution text: { a small clear schoolbag is a birthday gift sent by dad to him. }.

As another alternative example, replacing the second "he" in the initial text with "xiaoming" results in a second referring to the resolved text: { a satchel in Xiaoming is a birthday gift sent by dad to Xiaoming. }.

As yet another alternative example, replacing the first "he" and the second "he" in the initial text with "xiaoming" results in a third referring to the resolved text: { the schoolbag of Xiaoming is a birthday gift sent by Xiaoming dad to Xiaoming. }.

In another embodiment of the present specification, the initial text may be: { Xiaoming is reading the weakness of humanity (this book is too profound to read by him). }

Through the reference decomposition processing, it can be obtained that the components "Xiaoming" and "He" in the initial text have the same reference relationship, namely reference relationship 1, and the components "weakness of human nature" and "this book" in the initial text have the same reference relationship, namely reference relationship 2.

At this time, the components in the reference relation 1 may be selectively subjected to reference resolution processing to obtain a reference resolution text of the reference relation 1: { Xiaoming is reading the weakness of humanity (Xiaoming), this book is too profound and unclear. }. Or selecting to perform the reference resolution processing on the components in the reference relationship 2 to obtain a reference resolution text of the reference relationship 2: { Xiaoming is reading the weakness of humanity (Xiaoming's weak points of humanity), the weakness of humanity is too profound to be understood by him. }. And optionally performing reference resolution processing on the components in the reference relations 1 and 2 to obtain a reference resolution text common to the reference relations 1 and 2: { Xiaoming is reading the weakness of humanity (Xiaoming's weakness), the weakness of humanity is too profound, and Xiaoming is not understood. }.

It should be noted that the above-mentioned embodiments are only used for illustration, and in practical applications, various alternative examples can be combined and cross-referenced without conflict, so as to extend the possible embodiments, which can be considered as the embodiments disclosed and disclosed in the present specification. For example, after the first reference resolution text is obtained by replacing the first "other" in the initial text with "Xiaoming", the second reference resolution text is obtained by replacing the second "other" in the initial text with "Xiaoming". Resulting in two reference resolution texts.

It can be understood that, in practical applications, the text may include a more complex sentence structure, so that there may be a plurality of reference relationships and each reference relationship has a corresponding plurality of components, and the plurality of reference relationships and the plurality of components may be selected, thereby obtaining a plurality of selection manners of permutation and combination. Moreover, the preset selection rule can be changed according to the actual situation. For example, after the "weakness of human nature" of the above-mentioned components is changed into "drift", the length of the components "drift" is smaller than that of the components "this book", if the preset selection rule is to select the component with the longest length in the same reference relationship as the main component, then "this book" will be used as the main component, and if the preset selection rule is to select the proper noun in the same reference relationship as the main component, then the components "drift" will be used as the main component. The embodiment of the present specification does not limit the text content and the selection rule of the main components.

In a specific implementation, in order to facilitate positioning of positions of components having the same reference relationship in a text, an initial text may be preprocessed, and according to a preset ending symbol set, the initial text may be split into at least one initial sentence, and the preprocessed initial text may be regarded as a set including the at least one initial sentence. By analogy, the reference resolution text obtained through the reference resolution process can be regarded as a set containing at least one reference resolution sentence.

The preset ending symbol set may be set according to the language type of the initial text, for example, if the language type of the initial text is chinese, the ending symbol set may include a period, a semicolon, a question mark, an exclamation mark, and the like of chinese; if the language type of the initial text is english, the ending symbol set may include english periods, semicolons, question marks, exclamation marks, and the like.

In specific implementation, each of the reference resolution texts may be respectively input into a preset entity recognition model, an entity prediction probability matrix of each of the reference resolution texts is obtained, a distribution position in the entity prediction probability matrix of each of the reference resolution texts, which meets a preset first condition, is determined, and a component of a corresponding distribution position in each of the reference resolution texts is used as an entity, so as to obtain an entity recognition result.

In actual implementation, the reference resolution text can be regarded as a set containing at least one reference resolution sentence, and after the reference resolution processing, the dependency between the reference resolution sentences is reduced, so that the reference resolution sentences can be respectively input into a preset entity recognition model for entity recognition, and the entity recognition efficiency is improved.

In an embodiment of the present specification, the entity recognition model may include an input layer, an encoding layer, a full-connection layer, a decoding layer, and an output layer, wherein:

1) the input layer may receive an input reference resolution statement C: { C₁,c₂…c_mAnd the resolution sentence C comprises m marks (tokens), wherein the marks can be punctuation marks or word segmentation units, and the word segmentation units and the punctuation marks are minimum sentence composition units of corresponding language types.

Wherein m may be a hyper-parameter of the preset entity recognition model, and limits the input sentence length. For example, if the length of the resolution sentence is not greater than 128, the entity recognition model may be input. If the length of the reference resolution sentence is larger than 128, the reference resolution sentence can be divided into a plurality of reference resolution sentence segments with the length not larger than 128, and then the reference resolution sentence segments are input into the entity recognition model. Optionally, the length of the generation resolution statement may be kept consistent by performing a vacancy filling (padding) process on the reference resolution statement with the length less than 128.

In addition, the input layer may further map the token in the resolution reference statement C to a value that can be processed by the entity recognition model according to a preset dictionary, so as to obtain a resolution reference statement after dictionary mapping, that is, CID ═ C { (CID) }₁,cid₂…cid_m}，cid₁,cid₂…cid_mAre respectively c₁,c₂…c_mIndex values in the dictionary. And transmitting the reference resolution statement CID after dictionary mapping processing to an encoding layer.

2) The above-mentionedA coding layer suitable for coding the reference resolution statement CID after dictionary mapping processing through a preset language submodel to obtain a coding characteristic matrix [ CE ]]＝[CE₁,CE₂…CE_m],CE₁,CE₂…CE_mAre each cid₁,cid₂…cid_mThe dimension of each coding feature vector is determined by the parameters of the language sub-model. Encoding feature matrix [ CE ]]And transmitting to the full connection layer.

Wherein, the encoding process can embed (Embedding) algorithm, and the index values CID in the resolution sentence CID after the dictionary mapping process are used₁,cid₂…cid_mVectorization is performed.

3) The full connection layer may perform dimension reduction processing on the coding feature matrix [ CE ] to obtain a coding feature matrix [ CE '] after the dimension reduction processing, and transmit the coding feature matrix [ CE' ] after the dimension reduction processing to the decoding layer, wherein the full connection layer may perform the dimension reduction processing on the coding feature matrix [ CE ] by using MLP (multi layer per predictor, multi layer Perceptron neuron).

4) The decoding layer can perform nonlinear mapping processing on the coding feature matrix [ CE' ] after the dimension reduction processing to obtain an entity prediction probability matrix [ CY ]. Wherein, the nonlinear mapping processing can be performed by adopting an activation function such as Sigmoid. Then, the decoding layer determines distribution positions which meet a preset first condition in the entity prediction probability matrix of each reference resolution text according to the preset first condition, and takes components of the corresponding distribution positions in each reference resolution text as entities.

5) And the output layer outputs an entity recognition result.

In a specific implementation, the entity prediction probability matrices for each reference resolution text may include: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text, and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text.

As shown in fig. 2, the determining the distribution positions in the entity prediction probability matrix of each reference resolution text, which meet the preset first condition, and taking the components of the corresponding distribution positions in each reference resolution text as entities may include:

s21, comparing the first prediction probability vector with a preset first threshold value, determining the distribution position of the probability value in the first prediction probability vector greater than the first threshold value, and obtaining entity prediction initial position distribution information of each reference resolution text.

And S22, comparing the second prediction probability vector with a preset second threshold, and determining the distribution position of the probability value in the second prediction probability vector greater than the second threshold to obtain the entity prediction end position distribution information of each reference resolution text.

And S23, obtaining entity distribution position intervals of each reference resolution text based on the entity prediction starting position distribution information and the entity prediction ending position distribution information of each reference resolution text, and obtaining components in the corresponding distribution position intervals in each reference resolution text as entities.

The first threshold and the second threshold may be set according to an actual situation, and the first threshold and the second threshold may be the same or different.

For example,

predicting probability matrices for entities, respectively

And a first predictive probability vector and a second predictive probability vector, i.e.

The first threshold value is u_sThe second threshold value is u_eThen by inequality

And

respectively find

And obtaining entity prediction starting position distribution information and entity prediction ending position distribution information according to the distribution position of the probability value meeting the inequality condition, and finally determining an entity distribution position interval.

By adopting the scheme, the initial position and the end position of the entity are respectively judged, so that the entity of a single character can be obtained, and the threshold value is used as a judgment condition, so that each entity can be obtained through the distribution position interval under the complex condition that the text contains a plurality of entities, and the accuracy rate of obtaining the entity is increased.

In a specific implementation, as shown in fig. 3, before the entity recognition processing is performed on each of the reference resolution texts, the training of the entity recognition model may further be performed, which specifically includes:

and S31, inputting a preset corpus and an entity true probability matrix of the corpus into the entity recognition model for training to obtain an entity prediction probability matrix of the corpus.

And S32, performing error calculation based on the entity prediction probability matrix of the training corpus and the entity real probability matrix of the training corpus to obtain a result error value.

And S33, if the result error value meets the preset training completion condition, the entity recognition model completes training, otherwise, the parameters of the entity recognition model are adjusted, and the training corpus and the entity real probability matrix of the training corpus are input into the entity recognition model after the parameters are adjusted to train until the entity recognition model completes training.

In a specific implementation, the result error value may be calculated by a loss function, and when the result error value is greater than the result error threshold, the result error value satisfies a preset training completion condition, the entity recognition model is not trained, and parameters of the entity recognition model may be adjusted. And when the result error value is smaller than the result error threshold value, the result error value does not meet the preset training completion condition, and the entity recognition model completes training.

Optionally, the error coincidence frequency judgment condition can be increased, and training by judging the entity recognition model by one error result error value can be avoided. For example, when the result error value is less than the result error threshold, the number of times of error agreement is increased by one, and it is determined whether the number of times of error agreement is greater than or equal to the error number threshold, if so, the entity recognition model completes training, otherwise, parameters of the entity recognition model may also be adjusted.

According to the loss function, the parameters of the entity recognition model can be adjusted by adopting a gradient descent method or a back propagation method. And inputting the training data and the entity real probability matrix of the training data into the adjusted entity recognition model again until the entity recognition model meeting the training completion condition is obtained.

It is understood that the entity type that can be recognized after the entity recognition model is trained is determined by the corpus and the model parameters, for example, the corpus containing the subject information is input, and the model parameters are adjusted according to the recognition result and the actual subject data of the corpus, so that the entity recognition model that is finally trained can recognize the entity of the subject type. According to actual requirements, different training corpora can be selected and different model parameters can be set, which is not limited in the embodiments of the present description.

In a specific implementation, the entity prediction probability matrix of the corpus may include: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus;

the entity true probability matrix of the corpus may include: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;

thus, the performing error calculation based on the entity prediction probability matrix of the corpus and the entity true probability matrix of the corpus to obtain a result error value may include:

wherein, y_siIs the first true probability vector y_sThe ith probability value; y is_eiIs the second true probability vector y_eThe ith probability value;

for the third prediction probability vector

The ith probability value;

for the fourth prediction probability vector

The ith probability value; i is a natural number.

The loss function loss can be understood as a formula for calculating the distance between the entity true probability matrix and the entity prediction probability matrix. If the entity recognition model is output

With preset y_s、y_eIs very close, the resulting error value calculated from the loss function loss is very small and conversely very large.

In a specific implementation, the language submodel may be pre-trained such that the pre-trained language submodel is able to capture context information in depth.

In a specific implementation, the screening the resolution reference sentences in the same order based on the entity recognition results of the resolution reference sentences in the same order may include at least one of:

1) and acquiring the reference resolution sentences with the same entity recognition result from the reference resolution sentences in the same order, and deleting the reference resolution sentences with the same entity recognition result based on a preset second condition.

In particular implementations, the second condition may be set based on the entity prediction probability matrix.

For example, the reference resolution sentences in the same order and the same entity recognition result in each reference resolution text are D₁₁，D₁₂And D₁₃Wherein D is₁₁For the first reference resolution statement in the first order in the text, D₁₂For the second reference resolution sentence in the first order in the text, D₁₃And resolving the third reference resolution sentence in the first order in the reference resolution text.

Resolving statement D from a reference₁₁，D₁₂And D₁₃Corresponding entity prediction probability matrixes can respectively obtain the reference resolution sentences D₁₁，D₁₂And D₁₃Corresponding entity prediction probability value is taken as a reference resolution statement D₁₁，D₁₂And D₁₃The higher the confidence coefficient is, the higher the reliability of the meaning resolution statement is, so that the meaning resolution statement with the highest entity prediction probability value can be reserved, and other meaning resolution statements can be deleted. If the entity prediction probability values are the same, one of the reference resolution sentences can be reserved in sequence or randomly, and the other reference resolution sentences can be deleted.

The entity prediction probability value can be obtained by calculating probability values which accord with a first threshold value and a second threshold value in the entity prediction probability matrix; or the probability values in the entity prediction probability matrix are obtained through calculation. The operation may be a summation operation, an averaging operation, a weighted averaging operation, or the like. This is not limited by the present description.

Since the reference resolution sentences are in the same order and the entity recognition results are the same, even if only one of the reference resolution sentences is selected, it is possible to ensure that the number of entities does not change and to reduce the amount of data.

2) And acquiring the reference resolution sentences with different entity recognition results from the reference resolution sentences in the same order, arranging and combining the reference resolution sentences, and deleting the arranged and combined reference resolution sentences based on a preset third condition.

In a specific implementation, since the positions of the main components replaced in the text are different, the reference resolution sentences in the same order may not obtain the same entity recognition result. At this time, if it is desired to delete the redundant sentence combinations, reduce the data amount, and ensure that the number of entities is not changed, it is necessary to keep the combination of the reference resolution sentences that satisfies the total number of entities and has the smallest number of sentences.

For example, the reference resolution sentences which are in the same order and have different entity recognition results in the reference resolution texts are F₁₁，F₁₂And F₁₃Wherein F is₁₁Identifying an entity entry 1, F for a first reference resolution sentence in a first order in a first reference resolution text₁₂For the second reference resolution sentence in the first order in the reference resolution text, two entities entity1 and entity2 are identified, F₁₃For the third reference resolution sentence in the first order in the reference resolution text, an entity entry 3 is identified. Will refer to a resolved statement F₁₁，F₁₂And F₁₃Carrying out permutation and combination to obtain a subset of the permutation and combination referring to the resolution sentences: { F₁₁、F₁₂、F₁₃、F₁₁+F₁₂、F₁₂+F₁₃、F₁₁+F₁₃And F₁₁+F₁₂+F₁₃The total number of entities is three, namely entity1, entity2 and entity3, wherein F₁₁、F₁₂、F₁₃、F₁₁+F₁₂、F₁₂+F₁₃、F₁₁+F₁₃And F₁₁+F₁₂+F₁₃All are permutation and combination reference resolution sentences.

According to the entity recognition result, the reference resolution statement F after permutation and combination can be determined₁₁、F₁₂、F₁₃、F₁₁+F₁₂、F₁₁+F₁₃All do not comprise three entities, and the arranged and combined reference resolution statement F₁₂+F₁₃And F₁₁+F₁₂+F₁₃Respectively comprising three entities. Then, by comparing the statement quantity, determining the reference resolution statement F after the permutation and combination₁₂+F₁₃Less than F₁₁+F₁₂+F₁₃Thereby, the reference resolution sentence F after permutation and combination can be reserved₁₂+F₁₃And deleting the reference resolution sentences combined by other permutations.

By adopting the scheme, the sentence with the same entity identification result can be deleted, or the sentences with different entity identification results can be arranged and combined, so that the reference resolution sentence after arrangement and combination can be deleted. Thus, the deleted statement set is ensured to have complete entity information.

In specific implementation, after obtaining the reference resolution sentences with different entity identification results from the reference resolution sentences in the same order, before arranging and combining, it may be determined whether there is a reference resolution sentence with an identified entity number equal to the total number of entities in the reference resolution sentences with different entity identification results, and if there is a reference resolution sentence, the reference resolution sentence may be retained, and other reference resolution sentences may be deleted, so that the processing flow and the processing amount may be reduced, and the sentence screening efficiency may be improved.

In practical applications, the semantic processing text obtained by the text processing method according to any of the above embodiments can be used in the information extraction field, and the following detailed description is given by using embodiments and accompanying drawings.

Referring to a flowchart of an information extraction method shown in fig. 4, in an embodiment of this specification, the method may specifically include the following steps:

s41, identifying the reference relationship existing in the initial text.

And S42, performing reference resolution processing on the initial text based on the identified reference relationship to obtain a corresponding reference resolution text.

And S43, respectively carrying out entity recognition processing on each reference resolution text.

S44, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order.

And S45, merging the filtered reference resolution sentences to obtain a semantic processing text.

And S46, performing information identification processing on the semantic processing text, and extracting corresponding components in the semantic processing text based on the information identification result.

Wherein the information identification process may include: label classification processing, sequence labeling processing and the like.

Therefore, the text processing method can establish association of the components with the reference relationship in the text by identifying the reference relationship in the text, so that subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by reference is reduced, the reference resolution sentences in each sequence are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the quality of text information is improved, the obtained semantic processing text contains various reference resolution sentences, the diversity of text sentences is increased, the semantic information in the text is enriched, clearer semantic information can be obtained in the process of extracting the text information, and the accuracy of the information extraction result can be improved.

In specific implementation, the text processing method can process the text at the document level, and regulate the expression mode in the text, so that the dependency between sentences in the text at the document level is reduced, the diversity of the text sentences is increased, and the semantic information in the text is enriched, thereby reducing the problem that entities cannot be aligned in the information extraction process, and realizing the information extraction task at the document level.

The relationship extraction task is one of specific applications of the information extraction task, and at present, there are two main relationship extraction methods. The first is a mode of end-to-end joint entity identification and relationship identification, a preset relationship extraction model is adopted to perform one-time calculation, and all SPO (Subject + Predicate + Object ) triple relationships in a text are identified. The output dimensionality of the relational extraction model for realizing the method is very high, the data is extremely sparse, and the relational extraction model for realizing the method needs a large amount of data to support training, but the training process is difficult to converge, and the trained relational extraction model is difficult to obtain.

The second way is a step-by-step extraction way, which first finds out all entities in the text, then randomly extracts two entities to obtain an entity pair (which may also be called a relationship element pair in a relationship extraction task), and judges the relationship between the two entities according to the entity pair. Although the second approach can solve the problem of the first approach, extracting a large number of meaningless entity pairs can result in many wrong relationship information or missing some relationship information, and depending on the semantic relationships in the text sentences, the entities may be part of the subject or subject in some sentences and object or object in other sentences. The second method cannot identify the relationship information of the subject and the object having the same or partially the same entity, so that the accuracy of the information extraction result is low and the accuracy of the extraction result is difficult to improve.

In the information extraction method adopted in the embodiments of the present specification, the text processing method described in any of the above embodiments can solve the problem of entity alignment and complex relation between texts, optimize semantic information in a text for information extraction, and obtain clearer semantic information from a semantic processing text, thereby improving the accuracy of a relation recognition result and an entity on an extraction result, obtaining correct components from the semantic processing text, and improving the reliability and accuracy of a relation extraction task.

As an alternative example, in order to solve the problem that the relationship information that the subject and the object have the same or partially the same entity cannot be recognized, the following information recognition processing may be performed in the embodiments of the present specification for obtaining the semantically processed text by using the text processing method described in any of the above embodiments:

and performing relation identification processing on the semantic processing text to obtain a relation label corresponding to the semantic processing text, performing relation element identification processing on the semantic processing text based on the relation label to obtain a relation element pair corresponding to the relation label in the semantic processing text, and performing analysis processing based on the relation label and the relation element pair corresponding to the relation label to obtain a relation triple set.

By adopting the scheme, the relation identification processing is firstly carried out, one or more relation information existing in the semantic processing text can be identified, the corresponding number of relation labels is obtained, then the relation element identification processing is carried out according to each relation label, the entity pair corresponding to each relation label is obtained, the entity pair of the appointed relation information is purposefully searched, the entity pair in the text is not randomly obtained, the accuracy of the obtained result of the entity pair is improved, and the problem that the relation information of the same or partially same entity in the subject and the object cannot be identified can be solved; in addition, the entity pairs of various relation information of the semantic processing text can be accurately identified through the relation labels, so that the problem of misjudgment or missed judgment of the relation information through the entity pairs is avoided, and the accuracy of the relation extraction result is improved.

Further optionally, through a preset relationship recognition model, relationship recognition processing (one of specific applications of label classification processing) may be performed on the semantic processing text; through a preset relation element extraction model, relation element identification processing (one of specific applications of sequence labeling processing) can be performed on the semantic processing text. Therefore, the whole relation extraction process comprises the relation identification module and the relation element extraction model, the relation identification module or the relation element extraction model can be independently optimized according to the actual situation, the parameters of the whole relation extraction task do not need to be adjusted, and the parameter adjustment efficiency is improved; the relationship identification module or the relationship element extraction model can be replaced according to the actual situation, so that the relationship extraction task is flexible and changeable, and the applicability is stronger.

In one embodiment of the present specification, the document level of the initial text S₀Comprises the following steps: { Zhang three weeks one nominated Liqu is the general manager of a certain company. He chooses she because the colleague has a rich working experience as a pre-sales leader. Can be applied to the initial text S₀The method for extracting the relationship element specifically comprises the following steps:

1) after the reference decomposition process by the reference decomposition tool (such as Stanford CoreNLP), the initial text S can be obtained through the output reference Chain (Chain)₀The reference relationship existing in (1). For example, the output reference chain may include:

Chain 0：

"Zhang three" in sensor 1: [0, 2);

"He" in sensor 2: [0, 1);

Chain 1：

"Lisi" in sensor 1: [6, 8);

"she" in sensor 2: [3, 4);

"this colleague" in sensor 2: [7, 10);

wherein "Chain 0" and "Chain 1" represent two separate reference chains, respectively containing each component under corresponding reference relationship in the initial text S₀The location information in (1), for example, "in sensor 1: [0,2) "indicates that the component" Zhang three "is at the 1 st and second place in the initial sentence of the first order, and so on, the components with the same reference relationship can be obtained in the initial text S₀The location information in (1).

2) According to the preset ending symbol set { period, semicolon, question mark and exclamation mark }, the initial text S is processed₀Preprocessing is carried out, the initial text is split into at least one initial sentence, and the preprocessed initial text S is obtained₀＝{s₁,s₂In which s is₁The third week is named Liqu as the general manager of a certain company. S, s₂That he chooses she because the colleague has a rich working experience as a pre-sales leader. }.

3) At "Chain 0" andone main component is respectively selected from the "Chain 1", for example, the special terms "zhang san" and "lie san" are respectively selected as the main components of "Chain 0" and "Chain 1", which indicates that the other components in the Chain are non-main components. And according to the preset replacement principle, the initial text S is subjected to₀Performing a reference resolution process to obtain an initial text S₀At least one non-main component is replaced to obtain the reference resolution text. For example, the following three reference resolution texts S can be obtained₁，S₂And S₃Where underlining is used to illustrate the replacement part, it does not exist in practical applications:

S₁＝{s₁₁s₁₂}，s₁₁the third week is named Liqu as the general manager of a certain company. S, S₁₂＝{Zhang threeSelectingLi fourBecause the colleague has abundant working experience as a pre-sale supervisor };

S₂＝{s₂₁,s₂₂}，s₂₁the third week is named Liqu as the general manager of a certain company. S, s₂₂＝{Zhang threeShe was selected becauseLi fourAbundant working experience as pre-sales leaders };

S₃＝{s₃₁,s₃₂}，s₃₁the third week is named Liqu as the general manager of a certain company. S, s₃₂＝{Zhang threeSelectingLi fourIs due toLi fourAnd the system has abundant working experience as a pre-sale director }.

4) Resolving the reference text S₁，S₂And S₃And respectively carrying out entity identification processing, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity identification results of the reference resolution sentences in the same order to obtain the screened reference resolution sentences.

For example, refer to the resolved text S₁，S₂And S₃Is s in the first order₁₁、s₂₁And s₃₁Wherein s is₁₁、s₂₁And s₃₁If the entity recognition results are the same, the resolution statement s is determined according to the reference₁₁、s₂₁And s₃₁Corresponding entity prediction probability matrixes can respectively obtain the reference resolution sentences s₁₁、s₂₁And s₃₁The entities of (2) predict probability values, and are screened in this way, assuming that the reserved reference resolution statement s₁₁。

Also for example, refer to the resolved text S₁，S₂And S₃Is s in the second order₁₁、s₂₁And s₃₁Wherein s is₁₂The result of entity recognition is contained in s₂₂In the result of entity recognition of (1), s₂₂And s₃₂Is the same, according to s₂₂And s₃₂Screening the entity prediction probability value, supposing to reserve a resolution statement s₂₂Then, s can be judged₂₂And s₃₂Whether the identified reference resolution statement with the entity number being the total number of the entities exists in the system or not is determined₂₂If the number of the entities obtained by identification is the total number of the entities, s can be reserved₂₂And delete s₁₂And s₃₂。

5) Merging the filtered reference resolution sentences to obtain a semantic processing text S₀′＝{s₁₁s₂₂}。

6) Semantically processing text S₀Statement s in `₁₁，s₂₂Respectively carrying out existing relation recognition processing to obtain sentences s₁₁s₂₂Corresponding relationship labels. For example, the sentence s₁₁The relationship tags of (a) may include: { company, job, colleague }, sentence s₂₂The relationship tags of (a) may include: { job }.

7) Processing text S according to semantics₀Statement s in `₁₁And corresponding relational tags, and statements s₂₂And corresponding relation labels, respectively carrying out relation element identification processing to obtain the statement s₁₁s₂₂The pair of relationship elements corresponding to the relationship label.

For example, the sentence s₁₁The relationship element pair of (a) may include: and the relation label' publicThe department "a corresponding relationship element pair { Zhang three, a certain company }, a relationship element pair { Li four, a general manager }, a relationship element pair { Zhang three, Li four }, a sentence s, a sentence" at the same time ", and a relationship label" post₂₂The relationship element pair of (a) may include: the pair of relationship elements corresponding to the relationship label "job" { lie four, pre-sales director }.

8) According to the relationship tag and the corresponding relationship element pair, a set of relationship triples can be obtained through analysis processing, and the method specifically includes:

{ "subject": Zhang III "," predict ": company", "object": certain company "};

{ "subject": Liqu "," predicate ": company", "object": certain company "};

{ "subject": lie four "," predict ": post", "object": total manager "};

{ "subject": lie four "," predict ": post", "object": front sales manager "};

{ "subject": Zhang III "," predict ": colleague", "object": Li IV ".

As an optional example, an equivalent replacement relationship label set may be set, and if the relationship label of the semantic processing text matches one equivalent replacement label in the equivalent replacement relationship label set, during parsing, a host guest replacement processing may be performed on a relationship element pair corresponding to the equivalent replacement label, so as to obtain a pair of relationship triples having an equivalent replacement relationship.

For example, an equivalence substitution relationship labelset includes: the classification label 'simultaneous' of representing the relationship information of the colleagues, and the semantic processing text is as follows: { Xiaoming and Xiaohong are colleagues. And obtaining that the co-occurrence relationship information exists in the semantic processing text through relationship identification processing, and the relationship element pair related to the co-occurrence relationship information is 'Xiaoming' and 'Xiaohong', and when the relationship element pair is subjected to parsing processing, performing primary guest replacement processing on the 'Xiaoming' and 'Xiaohong' through an equivalence replacement relationship label set so as to obtain a pair of relationship triples with equivalence replacement relationships, namely { "subject": Xiaoming "," previous ": co-occurrence", "object": Xiaohong "} and {" subject ": Xiaohong", "previous": minor "}.

The present specification further provides a text processing device, which may include a first memory and a first processor, where the first memory stores computer instructions executable on the first processor, and the first processor may execute the steps of the method according to any one of the foregoing embodiments of the present specification when executing the computer instructions.

In a specific implementation, the text processing device may further include a first display interface and a first display accessed through the first display interface. The first display may display a semantically processed text obtained by the first processor executing the text processing method provided in the embodiments of the present specification.

An embodiment of the present specification further provides an information extraction system, including the text processing apparatus and the information acquisition apparatus described in any of the above embodiments, where the text processing apparatus and the information acquisition apparatus establish a communication connection through a communication interface, where:

the information acquisition device comprises a second memory and a second processor; wherein the second memory is adapted to store one or more computer instructions that when executed by the second processor perform the steps of:

In a specific implementation, the information obtaining device may further include a second display interface and a second display accessed through the second display interface. The second display may display a result of the information identification processing performed by the second processor in the information identification processing step provided in the embodiment of the present specification.

It is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or imply the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of the feature. Moreover, the terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the specification described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

The embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the method according to any of the above embodiments of the present invention may be executed. The computer readable storage medium may be various suitable readable storage media such as an optical disc, a mechanical hard disc, a solid state hard disc, and the like. The instructions stored in the computer-readable storage medium may be used to execute the method according to any of the embodiments, which may specifically refer to the embodiments described above and will not be described again.

The computer-readable storage medium may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, compact disk read Only memory (CD-ROM), compact disk recordable (CD-R), compact disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), a tape, a cassette, or the like.

The computer instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.

Although the embodiments of the present specification are disclosed above, the embodiments of the present specification are not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the embodiments herein described, and it is intended that the scope of the embodiments herein described be limited only by the scope of the appended claims.

Claims

1. A method of text processing, comprising:

identifying a reference relationship existing in the initial text;

2. The text processing method according to claim 1, wherein the performing the reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text comprises:

3. The text processing method according to claim 1 or 2, wherein the performing entity recognition processing on each reference resolution text respectively comprises:

4. The text processing method according to claim 3, further comprising, before the performing entity identification processing on each of the reference resolution texts, respectively:

5. The text processing method of claim 4, wherein the entity prediction probability matrices for each reference to the resolved text comprise: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text;

6. The method of claim 5, wherein the entity prediction probability matrix of the corpus comprises: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus; the entity true probability matrix of the training corpus comprises: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;

is the third prediction probabilityThe ith probability value in the vector;

7. An information extraction method, comprising:

obtaining a semantically processed text by the text processing method of any one of claims 1-6;

8. A text processing apparatus includes a first memory and a first processor; wherein the first memory is adapted to store one or more computer instructions, wherein the first processor when executing the computer instructions performs the steps of the method of any one of claims 1 to 6.

9. An information extraction system comprising the text processing apparatus and the information acquisition apparatus of claim 8, wherein:

10. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the method of any one of claims 1 to 7.