CN111695054A - Text processing method and device, information extraction method and system, and medium - Google Patents

Text processing method and device, information extraction method and system, and medium Download PDF

Info

Publication number
CN111695054A
CN111695054A CN202010537652.4A CN202010537652A CN111695054A CN 111695054 A CN111695054 A CN 111695054A CN 202010537652 A CN202010537652 A CN 202010537652A CN 111695054 A CN111695054 A CN 111695054A
Authority
CN
China
Prior art keywords
text
entity
reference resolution
processing
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010537652.4A
Other languages
Chinese (zh)
Inventor
沈大框
张莹
陈成才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaoi Robot Technology Co Ltd filed Critical Shanghai Xiaoi Robot Technology Co Ltd
Priority to CN202010537652.4A priority Critical patent/CN111695054A/en
Publication of CN111695054A publication Critical patent/CN111695054A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The text processing method comprises the following steps of: identifying a reference relationship existing in the initial text; performing reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text; respectively carrying out entity recognition processing on each reference resolution text; acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order; and merging the filtered reference resolution sentences to obtain a semantic processing text. By adopting the method, clearer semantic information can be obtained, and the accuracy of the information extraction result is improved.

Description

Text processing method and device, information extraction method and system, and medium
Technical Field
The embodiment of the specification relates to the technical field of information processing, in particular to a text processing method and device, an information extraction method and system, and a medium.
Background
In the era of explosion of internet Information, in order to quickly acquire required Information from mass Information of the internet, reasonable screening of internet Information is required, and Information Extraction (IE) technology is generated. The information extraction technology is to structure unstructured text so as to extract information on entities (entitys), relations (relationships), events (events), and the like from the text.
In the information extraction process, the semantic relation between events and entities is often scattered in different positions of a text, and the entities can usually have a plurality of different expression modes, so that the semantic information in the text is unclear, and the problem of omission or extraction error can occur when the information is extracted from the text.
Disclosure of Invention
In view of the above, in one aspect, embodiments of the present specification provide a text processing method, a text processing apparatus, and a text processing medium, which can obtain clearer semantic information in a text information extraction process.
On the other hand, the embodiments of the present specification further provide an information extraction method, system, and medium, which can improve the accuracy of the information extraction result.
An embodiment of the present specification provides a text processing method, including:
identifying a reference relationship existing in the initial text;
performing reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text;
respectively carrying out entity recognition processing on each reference resolution text;
acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order;
and merging the filtered reference resolution sentences to obtain a semantic processing text.
Optionally, the performing a reference resolution process on the initial text based on the identified reference relationship to obtain a processed reference resolution text includes:
acquiring the components of the reference relationship from the initial text, and selecting the main components of the reference relationship from the components;
and acquiring part or all of the components with the same reference relation with the main components from the initial text, replacing the components with the main components, and obtaining a processed reference resolution text.
Optionally, the performing entity identification processing on each reference resolution text respectively includes:
respectively inputting each reference resolution text into a preset entity recognition model to obtain an entity prediction probability matrix of each reference resolution text;
determining distribution positions which accord with a preset first condition in the entity prediction probability matrix of each reference resolution text, and taking components of the corresponding distribution positions in each reference resolution text as entities to obtain an entity recognition result.
Optionally, before the performing entity identification processing on each reference resolution text, the method further includes:
inputting a preset training corpus and an entity true probability matrix of the training corpus into the entity recognition model for training to obtain an entity prediction probability matrix of the training corpus;
performing error calculation based on the entity prediction probability matrix of the training corpus and the entity real probability matrix of the training corpus to obtain a result error value;
and if the result error value meets a preset training completion condition, finishing the training of the entity recognition model, otherwise, adjusting the parameters of the entity recognition model, and inputting the training corpus and the entity true probability matrix of the training corpus into the entity recognition model after the parameters are adjusted to train until the entity recognition model completes the training.
Optionally, the entity prediction probability matrices for each reference resolution text include: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text;
the determining the distribution positions in the entity prediction probability matrix of each reference resolution text, which meet a preset first condition, and taking the components of the corresponding distribution positions in each reference resolution text as entities, includes:
comparing the first prediction probability vector with a preset first threshold value, determining distribution positions with probability values larger than the first threshold value in the first prediction probability vector, and obtaining entity prediction initial position distribution information of each reference resolution text;
comparing the second prediction probability vector with a preset second threshold value, determining distribution positions of which the probability values in the second prediction probability vector are larger than the second threshold value, and obtaining entity prediction end position distribution information of each reference resolution text;
and obtaining entity distribution position intervals of each reference resolution text based on the entity prediction starting position distribution information and the entity prediction ending position distribution information of each reference resolution text, and obtaining components in the corresponding distribution position intervals in each reference resolution text as entities.
Optionally, the entity prediction probability matrix of the corpus includes: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus; the entity true probability matrix of the training corpus comprises: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;
the error calculation is performed on the entity prediction probability matrix based on the corpus and the entity real probability matrix based on the corpus to obtain a result error value, and the error calculation comprises the following steps:
error calculation is performed by using the following loss function to obtain a result error value:
Figure BDA0002537556940000031
wherein, ysiThe ith probability value in the first real probability vector; y iseiThe ith probability value in the second real probability vector;
Figure BDA0002537556940000032
is the ith probability value in the third prediction probability vector;
Figure BDA0002537556940000033
is the ith probability value in the fourth prediction probability vector; i is a natural number.
An embodiment of the present specification further provides an information extraction method, including:
obtaining a semantic processing text by adopting the text processing method of any one of the embodiments;
and performing information identification processing on the semantic processing text, and extracting corresponding components in the semantic processing text based on an information identification result.
The embodiment of the specification also provides a text processing device, which comprises a first memory and a first processor; wherein the first memory is adapted to store one or more computer instructions, which when executed by the first processor, perform the steps of the text processing method according to any of the above embodiments.
The embodiment of the present specification further provides an information extraction system, which includes a second memory and a second processor; wherein the second memory is adapted to store one or more computer instructions that when executed by the second processor perform the steps of:
and performing information identification processing on the semantic processing text obtained by the text processing equipment, and acquiring corresponding components in the semantic processing text based on an information identification result.
The present specification further provides a computer-readable storage medium, on which computer instructions are stored, where the computer instructions are executed to perform the steps of the text processing method or the information extraction method according to any one of the foregoing embodiments.
By adopting the text processing scheme of the embodiment of the specification, after the initial text is subjected to the reference resolution processing, the entity recognition processing can be respectively carried out on each reference resolution text, the reference resolution sentences in the same order in each reference resolution text are obtained, the reference resolution sentences in the same order are screened based on the entity recognition results of the reference resolution sentences in the same order, and then the screened reference resolution sentences are combined, so that the semantic processing text can be obtained. According to the scheme, by identifying the reference relations in the text, the components with the reference relations in the text can be associated, subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by the reference is reduced, the reference resolution sentences in all orders are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the text at a processed document level is processed conveniently, the text information quality is improved, and various reference resolution sentences contained in the obtained semantic processed text can be increased, the semantic information in the text is enriched, and clearer semantic information can be obtained in the text information extraction process.
Further, after obtaining the components of the reference relationship from the initial text, the main components of the reference relationship may be selected from the initial text, and a part or all of the components having the same reference relationship as the main components may be obtained from the initial text and replaced with the main components, thereby obtaining a processed reference resolution text. By adopting the scheme, the main components are selected and replaced for other components with the same reference relationship, so that the expression modes of the components with the same reference relationship in the text can be unified, and the subsequent entity identification processing is facilitated.
Further, an entity prediction probability matrix of each statement can be obtained through a preset entity recognition model, the first prediction probability vector is compared with a preset first threshold, the second prediction probability vector is compared with a preset second threshold, and an entity distribution position interval of each statement can be obtained, so that components in the corresponding distribution position interval in each statement are used as entities. By adopting the scheme, the initial position and the end position of the entity are respectively judged, so that the entity of a single character can be obtained, and the threshold value is used as a judgment condition, so that each entity can be obtained through the distribution position interval under the complex condition that the text contains a plurality of entities, and the accuracy rate of obtaining the entity is increased.
By adopting the information extraction scheme of the embodiment of the description, after the initial text is subjected to the text processing scheme to obtain the semantic processing text, the information identification processing can be performed on the semantic processing text, and the corresponding components in the semantic processing text can be obtained. By adopting the scheme, the semantic information in the text for information extraction is optimized, clearer semantic information can be obtained from the semantic processing text, and the reliability of the information extraction task is improved, so that the accuracy of the information extraction result can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a text processing method in an embodiment of the present specification;
FIG. 2 is a flowchart of a method for obtaining a reference resolution text entity in an embodiment of the present specification;
FIG. 3 is a flow chart of a method for training an entity recognition model in an embodiment of the present disclosure;
fig. 4 is a flowchart of an information extraction method in an embodiment of the present specification.
Detailed Description
In the era of explosion of internet Information, in order to quickly acquire required Information from mass Information of the internet, reasonable screening of internet Information is required, and Information Extraction (IE) technology is generated. The information extraction technology is to structure unstructured text so as to extract specific information, such as Entity (Entity), relationship (relationship), Event (Event), and the like, from the text.
In the information extraction process, semantic relations between events and entities are often scattered in different positions of a text, and the entities can usually have a plurality of different expression modes.
For example, in information extraction, entities in semantic relationships may appear in text in a reference form, since the text may include several paragraphs, a reference (indication) for referring to the entities may exist in each paragraph, and even for the continuity and smoothness of context, some entities may be omitted, thereby resulting in that all semantic information in the text cannot be recognized in information extraction.
In view of the above problems, embodiments of the present specification provide a text processing scheme, where after performing reference resolution processing on the initial text, entity identification processing may be performed on each reference resolution text, and after obtaining reference resolution sentences in the same order in each reference resolution text, the reference resolution sentences in the same order may be screened based on the entity identification results of the reference resolution sentences in the same order, and then the screened reference resolution sentences are merged, so as to obtain a semantic processing text with clear semantics.
In order to make the embodiments of the present disclosure more clearly understood and implemented by those skilled in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.
It is to be understood that the embodiments described herein are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person skilled in the art without making any inventive step based on the embodiments in this specification shall fall within the scope of protection of this specification.
Referring to a flowchart of a text processing method shown in fig. 1, in an embodiment of this specification, the method may specifically include the following steps:
s11, identifying the reference relationship existing in the initial text.
In a specific implementation, the initial text may be a sentence-level text or a document-level text (also referred to as a chapter-level text). The initial text is subjected to the reference decomposition processing through the existing reference decomposition tool or the reference decomposition algorithm, the components with the same reference relationship are identified, and the existence of the reference relationship in the initial text can be determined.
Wherein, the reference decomposition tool may include: a natural language processing tool such as Stanford CoreNLP (a natural language processing tool developed by Stanford university) can perform a reference decomposition process on the original text by using an applicable code. And, the identified components having the same reference relationship may include: at least one of nouns, pronouns, and zero pronouns.
And S12, performing reference resolution processing on the initial text based on the identified reference relationship to obtain a corresponding reference resolution text.
Wherein the reference resolution process may include: pre-finger digestion treatment and back-finger digestion treatment. Through the reference resolution processing, at least one of nouns, pronouns and zero pronouns with the same reference relationship in the initial text can be resolved, so that the components with the same reference relationship can adopt a uniform expression mode.
And S13, respectively carrying out entity recognition processing on each reference resolution text.
In a specific implementation, entities may be classified into different types according to actual situations. For example, according to the grammar rule, the entity may be divided into a subject, a predicate, an object, and the like, and according to the part-of-speech rule, the entity may be divided into a noun, a verb, a preposition, and the like. Before the entity identification processing is performed, the type of the identified entity may be set, for example, an entity that identifies the type of a subject may be set, or an entity that identifies the type of a noun may be set.
S14, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order.
In specific implementation, a plurality of reference resolution texts can be obtained through the reference resolution processing, the number of the reference resolution sentences is increased by multiple times, and particularly, after the initial texts at the document level are subjected to the reference resolution processing, the number of the reference resolution sentences is increased sharply. At this time, the sentences in the same order in the reference resolution texts can be acquired through longitudinal alignment, and the reference resolution sentences in the same order are subjected to screening processing.
For example, the initial text may include three initial sentences, and after the reference resolution, the first reference resolution text a also includes three reference resolution sentences a1,a2,a3I.e. A ═ a1,a2,a3And the second reference resolution text B comprises three reference resolution sentences B1,b2,b3I.e. B ═ B1,b2,b3Longitudinally aligning each reference resolution statement in the first reference resolution text A and the second reference resolution text B, namely a1And b1Alignment, a2And b2Alignment, and a3And b3Alignment is carried out, so that the association among the referring resolution sentences in different orders in the different referring resolution texts can be established, and when the referring resolution sentences in the first order need to be deleted, a in the first order can be acquired1And b1And by analogy, the sentences in the same order in the resolution texts are conveniently and quickly acquired.
Then, it is possible to screen out the statements that do not meet the condition based on the entity recognition results that refer to the resolved statements in the same order, thereby reducing the data amount.
And S15, merging the filtered reference resolution sentences to obtain a semantic processing text.
And the number of the filtered referring resolution sentences is not less than the number of the sentences in the initial text.
According to the scheme, by identifying the reference relations in the text, the components with the reference relations in the text can be associated, subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by the reference is reduced, the reference resolution sentences in all orders are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the text at a processed document level is processed conveniently, the text information quality is improved, and various reference resolution sentences contained in the obtained semantic processed text can be increased, the semantic information in the text is enriched, and clearer semantic information can be obtained in the text information extraction process.
In a specific implementation, if a plurality of reference relationships are identified or a plurality of corresponding components exist in the reference relationships, the initial text can be selectively subjected to reference resolution processing, so that at least one reference resolution text can be obtained.
For example, when a plurality of reference relationships are identified, components corresponding to a part of the reference relationships may be obtained from the initial text to perform reference resolution processing to obtain one reference resolution text, or all components corresponding to the reference relationships may be obtained from the initial text to perform reference resolution processing to obtain another reference resolution text.
For another example, when there are multiple corresponding components of the reference relationship, the components of the reference relationship may be obtained from the initial text, and the principal component of the reference relationship may be selected from the initial text, and a part or all of the components having the same reference relationship as the principal component may be obtained from the initial text and replaced by the principal component, so as to obtain a corresponding reference resolution text.
According to a preset selection rule, at least one of the obtained components of the reference relationship can be selected as a main component, and a reference resolution text corresponding to each main component is obtained. If the initial text comprises at least two components with the same reference relationship with the main components, at least one component with the same reference relationship with the main components can be obtained from the initial text and replaced by the main components to obtain the corresponding reference resolution text.
By adopting the scheme, the main components are selected and replaced for other components with the same reference relationship, so that the expression modes of the components with the same reference relationship in the text can be unified, and the subsequent entity identification processing is facilitated.
In one embodiment of the present specification, the initial text may be: { a small, clear schoolbag is a birthday gift that his dad sends to him. Wherein the symbols "{ }" are only used to limit the content range of the example, and are not an essential part in representing the content of the initial text, and those skilled in the art can use other symbols which are not easy to be confused to limit the content range of the initial text, and the following "{ }" is the same as above.
Through the reference decomposition processing, the components with the same reference relationship are obtained as the Xiaoming, the first other and the second other in the initial text, and one reference relationship exists in the initial text. Based on the reference relationship, if the preset selection rule is to select the component with the longest length in the same reference relationship as the main component, the 'Xiaoming' is used as the main component of the reference relationship, and at least one of the first 'he' and the second 'he' in the initial text can be selected to perform the reference resolution processing, so as to obtain the corresponding reference resolution text.
As an alternative example, replacing the first "he" in the initial text with "xiaoming" results in the first referring resolution text: { a small clear schoolbag is a birthday gift sent by dad to him. }.
As another alternative example, replacing the second "he" in the initial text with "xiaoming" results in a second referring to the resolved text: { a satchel in Xiaoming is a birthday gift sent by dad to Xiaoming. }.
As yet another alternative example, replacing the first "he" and the second "he" in the initial text with "xiaoming" results in a third referring to the resolved text: { the schoolbag of Xiaoming is a birthday gift sent by Xiaoming dad to Xiaoming. }.
In another embodiment of the present specification, the initial text may be: { Xiaoming is reading the weakness of humanity (this book is too profound to read by him). }
Through the reference decomposition processing, it can be obtained that the components "Xiaoming" and "He" in the initial text have the same reference relationship, namely reference relationship 1, and the components "weakness of human nature" and "this book" in the initial text have the same reference relationship, namely reference relationship 2.
At this time, the components in the reference relation 1 may be selectively subjected to reference resolution processing to obtain a reference resolution text of the reference relation 1: { Xiaoming is reading the weakness of humanity (Xiaoming), this book is too profound and unclear. }. Or selecting to perform the reference resolution processing on the components in the reference relationship 2 to obtain a reference resolution text of the reference relationship 2: { Xiaoming is reading the weakness of humanity (Xiaoming's weak points of humanity), the weakness of humanity is too profound to be understood by him. }. And optionally performing reference resolution processing on the components in the reference relations 1 and 2 to obtain a reference resolution text common to the reference relations 1 and 2: { Xiaoming is reading the weakness of humanity (Xiaoming's weakness), the weakness of humanity is too profound, and Xiaoming is not understood. }.
It should be noted that the above-mentioned embodiments are only used for illustration, and in practical applications, various alternative examples can be combined and cross-referenced without conflict, so as to extend the possible embodiments, which can be considered as the embodiments disclosed and disclosed in the present specification. For example, after the first reference resolution text is obtained by replacing the first "other" in the initial text with "Xiaoming", the second reference resolution text is obtained by replacing the second "other" in the initial text with "Xiaoming". Resulting in two reference resolution texts.
It can be understood that, in practical applications, the text may include a more complex sentence structure, so that there may be a plurality of reference relationships and each reference relationship has a corresponding plurality of components, and the plurality of reference relationships and the plurality of components may be selected, thereby obtaining a plurality of selection manners of permutation and combination. Moreover, the preset selection rule can be changed according to the actual situation. For example, after the "weakness of human nature" of the above-mentioned components is changed into "drift", the length of the components "drift" is smaller than that of the components "this book", if the preset selection rule is to select the component with the longest length in the same reference relationship as the main component, then "this book" will be used as the main component, and if the preset selection rule is to select the proper noun in the same reference relationship as the main component, then the components "drift" will be used as the main component. The embodiment of the present specification does not limit the text content and the selection rule of the main components.
In a specific implementation, in order to facilitate positioning of positions of components having the same reference relationship in a text, an initial text may be preprocessed, and according to a preset ending symbol set, the initial text may be split into at least one initial sentence, and the preprocessed initial text may be regarded as a set including the at least one initial sentence. By analogy, the reference resolution text obtained through the reference resolution process can be regarded as a set containing at least one reference resolution sentence.
The preset ending symbol set may be set according to the language type of the initial text, for example, if the language type of the initial text is chinese, the ending symbol set may include a period, a semicolon, a question mark, an exclamation mark, and the like of chinese; if the language type of the initial text is english, the ending symbol set may include english periods, semicolons, question marks, exclamation marks, and the like.
In specific implementation, each of the reference resolution texts may be respectively input into a preset entity recognition model, an entity prediction probability matrix of each of the reference resolution texts is obtained, a distribution position in the entity prediction probability matrix of each of the reference resolution texts, which meets a preset first condition, is determined, and a component of a corresponding distribution position in each of the reference resolution texts is used as an entity, so as to obtain an entity recognition result.
In actual implementation, the reference resolution text can be regarded as a set containing at least one reference resolution sentence, and after the reference resolution processing, the dependency between the reference resolution sentences is reduced, so that the reference resolution sentences can be respectively input into a preset entity recognition model for entity recognition, and the entity recognition efficiency is improved.
In an embodiment of the present specification, the entity recognition model may include an input layer, an encoding layer, a full-connection layer, a decoding layer, and an output layer, wherein:
1) the input layer may receive an input reference resolution statement C: { C1,c2…cmAnd the resolution sentence C comprises m marks (tokens), wherein the marks can be punctuation marks or word segmentation units, and the word segmentation units and the punctuation marks are minimum sentence composition units of corresponding language types.
Wherein m may be a hyper-parameter of the preset entity recognition model, and limits the input sentence length. For example, if the length of the resolution sentence is not greater than 128, the entity recognition model may be input. If the length of the reference resolution sentence is larger than 128, the reference resolution sentence can be divided into a plurality of reference resolution sentence segments with the length not larger than 128, and then the reference resolution sentence segments are input into the entity recognition model. Optionally, the length of the generation resolution statement may be kept consistent by performing a vacancy filling (padding) process on the reference resolution statement with the length less than 128.
In addition, the input layer may further map the token in the resolution reference statement C to a value that can be processed by the entity recognition model according to a preset dictionary, so as to obtain a resolution reference statement after dictionary mapping, that is, CID ═ C { (CID) }1,cid2…cidm},cid1,cid2…cidmAre respectively c1,c2…cmIndex values in the dictionary. And transmitting the reference resolution statement CID after dictionary mapping processing to an encoding layer.
2) The above-mentionedA coding layer suitable for coding the reference resolution statement CID after dictionary mapping processing through a preset language submodel to obtain a coding characteristic matrix [ CE ]]=[CE1,CE2…CEm],CE1,CE2…CEmAre each cid1,cid2…cidmThe dimension of each coding feature vector is determined by the parameters of the language sub-model. Encoding feature matrix [ CE ]]And transmitting to the full connection layer.
Wherein, the encoding process can embed (Embedding) algorithm, and the index values CID in the resolution sentence CID after the dictionary mapping process are used1,cid2…cidmVectorization is performed.
3) The full connection layer may perform dimension reduction processing on the coding feature matrix [ CE ] to obtain a coding feature matrix [ CE '] after the dimension reduction processing, and transmit the coding feature matrix [ CE' ] after the dimension reduction processing to the decoding layer, wherein the full connection layer may perform the dimension reduction processing on the coding feature matrix [ CE ] by using MLP (multi layer per predictor, multi layer Perceptron neuron).
4) The decoding layer can perform nonlinear mapping processing on the coding feature matrix [ CE' ] after the dimension reduction processing to obtain an entity prediction probability matrix [ CY ]. Wherein, the nonlinear mapping processing can be performed by adopting an activation function such as Sigmoid. Then, the decoding layer determines distribution positions which meet a preset first condition in the entity prediction probability matrix of each reference resolution text according to the preset first condition, and takes components of the corresponding distribution positions in each reference resolution text as entities.
5) And the output layer outputs an entity recognition result.
In a specific implementation, the entity prediction probability matrices for each reference resolution text may include: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text, and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text.
As shown in fig. 2, the determining the distribution positions in the entity prediction probability matrix of each reference resolution text, which meet the preset first condition, and taking the components of the corresponding distribution positions in each reference resolution text as entities may include:
s21, comparing the first prediction probability vector with a preset first threshold value, determining the distribution position of the probability value in the first prediction probability vector greater than the first threshold value, and obtaining entity prediction initial position distribution information of each reference resolution text.
And S22, comparing the second prediction probability vector with a preset second threshold, and determining the distribution position of the probability value in the second prediction probability vector greater than the second threshold to obtain the entity prediction end position distribution information of each reference resolution text.
And S23, obtaining entity distribution position intervals of each reference resolution text based on the entity prediction starting position distribution information and the entity prediction ending position distribution information of each reference resolution text, and obtaining components in the corresponding distribution position intervals in each reference resolution text as entities.
The first threshold and the second threshold may be set according to an actual situation, and the first threshold and the second threshold may be the same or different.
For example,
Figure BDA0002537556940000111
predicting probability matrices for entities, respectively
Figure BDA0002537556940000112
And a first predictive probability vector and a second predictive probability vector, i.e.
Figure BDA0002537556940000113
The first threshold value is usThe second threshold value is ueThen by inequality
Figure BDA0002537556940000114
And
Figure BDA0002537556940000115
respectively find
Figure BDA0002537556940000116
And obtaining entity prediction starting position distribution information and entity prediction ending position distribution information according to the distribution position of the probability value meeting the inequality condition, and finally determining an entity distribution position interval.
By adopting the scheme, the initial position and the end position of the entity are respectively judged, so that the entity of a single character can be obtained, and the threshold value is used as a judgment condition, so that each entity can be obtained through the distribution position interval under the complex condition that the text contains a plurality of entities, and the accuracy rate of obtaining the entity is increased.
In a specific implementation, as shown in fig. 3, before the entity recognition processing is performed on each of the reference resolution texts, the training of the entity recognition model may further be performed, which specifically includes:
and S31, inputting a preset corpus and an entity true probability matrix of the corpus into the entity recognition model for training to obtain an entity prediction probability matrix of the corpus.
And S32, performing error calculation based on the entity prediction probability matrix of the training corpus and the entity real probability matrix of the training corpus to obtain a result error value.
And S33, if the result error value meets the preset training completion condition, the entity recognition model completes training, otherwise, the parameters of the entity recognition model are adjusted, and the training corpus and the entity real probability matrix of the training corpus are input into the entity recognition model after the parameters are adjusted to train until the entity recognition model completes training.
In a specific implementation, the result error value may be calculated by a loss function, and when the result error value is greater than the result error threshold, the result error value satisfies a preset training completion condition, the entity recognition model is not trained, and parameters of the entity recognition model may be adjusted. And when the result error value is smaller than the result error threshold value, the result error value does not meet the preset training completion condition, and the entity recognition model completes training.
Optionally, the error coincidence frequency judgment condition can be increased, and training by judging the entity recognition model by one error result error value can be avoided. For example, when the result error value is less than the result error threshold, the number of times of error agreement is increased by one, and it is determined whether the number of times of error agreement is greater than or equal to the error number threshold, if so, the entity recognition model completes training, otherwise, parameters of the entity recognition model may also be adjusted.
According to the loss function, the parameters of the entity recognition model can be adjusted by adopting a gradient descent method or a back propagation method. And inputting the training data and the entity real probability matrix of the training data into the adjusted entity recognition model again until the entity recognition model meeting the training completion condition is obtained.
It is understood that the entity type that can be recognized after the entity recognition model is trained is determined by the corpus and the model parameters, for example, the corpus containing the subject information is input, and the model parameters are adjusted according to the recognition result and the actual subject data of the corpus, so that the entity recognition model that is finally trained can recognize the entity of the subject type. According to actual requirements, different training corpora can be selected and different model parameters can be set, which is not limited in the embodiments of the present description.
In a specific implementation, the entity prediction probability matrix of the corpus may include: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus;
the entity true probability matrix of the corpus may include: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;
thus, the performing error calculation based on the entity prediction probability matrix of the corpus and the entity true probability matrix of the corpus to obtain a result error value may include:
error calculation is performed by using the following loss function to obtain a result error value:
Figure BDA0002537556940000131
wherein, ysiIs the first true probability vector ysThe ith probability value; y iseiIs the second true probability vector yeThe ith probability value;
Figure BDA0002537556940000132
for the third prediction probability vector
Figure BDA0002537556940000133
The ith probability value;
Figure BDA0002537556940000134
for the fourth prediction probability vector
Figure BDA0002537556940000135
The ith probability value; i is a natural number.
The loss function loss can be understood as a formula for calculating the distance between the entity true probability matrix and the entity prediction probability matrix. If the entity recognition model is output
Figure BDA0002537556940000136
With preset ys、yeIs very close, the resulting error value calculated from the loss function loss is very small and conversely very large.
In a specific implementation, the language submodel may be pre-trained such that the pre-trained language submodel is able to capture context information in depth.
In a specific implementation, the screening the resolution reference sentences in the same order based on the entity recognition results of the resolution reference sentences in the same order may include at least one of:
1) and acquiring the reference resolution sentences with the same entity recognition result from the reference resolution sentences in the same order, and deleting the reference resolution sentences with the same entity recognition result based on a preset second condition.
In particular implementations, the second condition may be set based on the entity prediction probability matrix.
For example, the reference resolution sentences in the same order and the same entity recognition result in each reference resolution text are D11,D12And D13Wherein D is11For the first reference resolution statement in the first order in the text, D12For the second reference resolution sentence in the first order in the text, D13And resolving the third reference resolution sentence in the first order in the reference resolution text.
Resolving statement D from a reference11,D12And D13Corresponding entity prediction probability matrixes can respectively obtain the reference resolution sentences D11,D12And D13Corresponding entity prediction probability value is taken as a reference resolution statement D11,D12And D13The higher the confidence coefficient is, the higher the reliability of the meaning resolution statement is, so that the meaning resolution statement with the highest entity prediction probability value can be reserved, and other meaning resolution statements can be deleted. If the entity prediction probability values are the same, one of the reference resolution sentences can be reserved in sequence or randomly, and the other reference resolution sentences can be deleted.
The entity prediction probability value can be obtained by calculating probability values which accord with a first threshold value and a second threshold value in the entity prediction probability matrix; or the probability values in the entity prediction probability matrix are obtained through calculation. The operation may be a summation operation, an averaging operation, a weighted averaging operation, or the like. This is not limited by the present description.
Since the reference resolution sentences are in the same order and the entity recognition results are the same, even if only one of the reference resolution sentences is selected, it is possible to ensure that the number of entities does not change and to reduce the amount of data.
2) And acquiring the reference resolution sentences with different entity recognition results from the reference resolution sentences in the same order, arranging and combining the reference resolution sentences, and deleting the arranged and combined reference resolution sentences based on a preset third condition.
In a specific implementation, since the positions of the main components replaced in the text are different, the reference resolution sentences in the same order may not obtain the same entity recognition result. At this time, if it is desired to delete the redundant sentence combinations, reduce the data amount, and ensure that the number of entities is not changed, it is necessary to keep the combination of the reference resolution sentences that satisfies the total number of entities and has the smallest number of sentences.
For example, the reference resolution sentences which are in the same order and have different entity recognition results in the reference resolution texts are F11,F12And F13Wherein F is11Identifying an entity entry 1, F for a first reference resolution sentence in a first order in a first reference resolution text12For the second reference resolution sentence in the first order in the reference resolution text, two entities entity1 and entity2 are identified, F13For the third reference resolution sentence in the first order in the reference resolution text, an entity entry 3 is identified. Will refer to a resolved statement F11,F12And F13Carrying out permutation and combination to obtain a subset of the permutation and combination referring to the resolution sentences: { F11、F12、F13、F11+F12、F12+F13、F11+F13And F11+F12+F13The total number of entities is three, namely entity1, entity2 and entity3, wherein F11、F12、F13、F11+F12、F12+F13、F11+F13And F11+F12+F13All are permutation and combination reference resolution sentences.
According to the entity recognition result, the reference resolution statement F after permutation and combination can be determined11、F12、F13、F11+F12、F11+F13All do not comprise three entities, and the arranged and combined reference resolution statement F12+F13And F11+F12+F13Respectively comprising three entities. Then, by comparing the statement quantity, determining the reference resolution statement F after the permutation and combination12+F13Less than F11+F12+F13Thereby, the reference resolution sentence F after permutation and combination can be reserved12+F13And deleting the reference resolution sentences combined by other permutations.
By adopting the scheme, the sentence with the same entity identification result can be deleted, or the sentences with different entity identification results can be arranged and combined, so that the reference resolution sentence after arrangement and combination can be deleted. Thus, the deleted statement set is ensured to have complete entity information.
In specific implementation, after obtaining the reference resolution sentences with different entity identification results from the reference resolution sentences in the same order, before arranging and combining, it may be determined whether there is a reference resolution sentence with an identified entity number equal to the total number of entities in the reference resolution sentences with different entity identification results, and if there is a reference resolution sentence, the reference resolution sentence may be retained, and other reference resolution sentences may be deleted, so that the processing flow and the processing amount may be reduced, and the sentence screening efficiency may be improved.
In practical applications, the semantic processing text obtained by the text processing method according to any of the above embodiments can be used in the information extraction field, and the following detailed description is given by using embodiments and accompanying drawings.
Referring to a flowchart of an information extraction method shown in fig. 4, in an embodiment of this specification, the method may specifically include the following steps:
s41, identifying the reference relationship existing in the initial text.
And S42, performing reference resolution processing on the initial text based on the identified reference relationship to obtain a corresponding reference resolution text.
And S43, respectively carrying out entity recognition processing on each reference resolution text.
S44, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order.
And S45, merging the filtered reference resolution sentences to obtain a semantic processing text.
And S46, performing information identification processing on the semantic processing text, and extracting corresponding components in the semantic processing text based on the information identification result.
Wherein the information identification process may include: label classification processing, sequence labeling processing and the like.
Therefore, the text processing method can establish association of the components with the reference relationship in the text by identifying the reference relationship in the text, so that subsequent reference resolution processing is facilitated, expression modes in the text are normalized, dependency between sentences caused by reference is reduced, the reference resolution sentences in each sequence are screened according to the entity identification result, and the reference resolution sentences containing useful information can be selected, so that the number of the reference resolution sentences is effectively reduced, the quality of text information is improved, the obtained semantic processing text contains various reference resolution sentences, the diversity of text sentences is increased, the semantic information in the text is enriched, clearer semantic information can be obtained in the process of extracting the text information, and the accuracy of the information extraction result can be improved.
In specific implementation, the text processing method can process the text at the document level, and regulate the expression mode in the text, so that the dependency between sentences in the text at the document level is reduced, the diversity of the text sentences is increased, and the semantic information in the text is enriched, thereby reducing the problem that entities cannot be aligned in the information extraction process, and realizing the information extraction task at the document level.
The relationship extraction task is one of specific applications of the information extraction task, and at present, there are two main relationship extraction methods. The first is a mode of end-to-end joint entity identification and relationship identification, a preset relationship extraction model is adopted to perform one-time calculation, and all SPO (Subject + Predicate + Object ) triple relationships in a text are identified. The output dimensionality of the relational extraction model for realizing the method is very high, the data is extremely sparse, and the relational extraction model for realizing the method needs a large amount of data to support training, but the training process is difficult to converge, and the trained relational extraction model is difficult to obtain.
The second way is a step-by-step extraction way, which first finds out all entities in the text, then randomly extracts two entities to obtain an entity pair (which may also be called a relationship element pair in a relationship extraction task), and judges the relationship between the two entities according to the entity pair. Although the second approach can solve the problem of the first approach, extracting a large number of meaningless entity pairs can result in many wrong relationship information or missing some relationship information, and depending on the semantic relationships in the text sentences, the entities may be part of the subject or subject in some sentences and object or object in other sentences. The second method cannot identify the relationship information of the subject and the object having the same or partially the same entity, so that the accuracy of the information extraction result is low and the accuracy of the extraction result is difficult to improve.
In the information extraction method adopted in the embodiments of the present specification, the text processing method described in any of the above embodiments can solve the problem of entity alignment and complex relation between texts, optimize semantic information in a text for information extraction, and obtain clearer semantic information from a semantic processing text, thereby improving the accuracy of a relation recognition result and an entity on an extraction result, obtaining correct components from the semantic processing text, and improving the reliability and accuracy of a relation extraction task.
As an alternative example, in order to solve the problem that the relationship information that the subject and the object have the same or partially the same entity cannot be recognized, the following information recognition processing may be performed in the embodiments of the present specification for obtaining the semantically processed text by using the text processing method described in any of the above embodiments:
and performing relation identification processing on the semantic processing text to obtain a relation label corresponding to the semantic processing text, performing relation element identification processing on the semantic processing text based on the relation label to obtain a relation element pair corresponding to the relation label in the semantic processing text, and performing analysis processing based on the relation label and the relation element pair corresponding to the relation label to obtain a relation triple set.
By adopting the scheme, the relation identification processing is firstly carried out, one or more relation information existing in the semantic processing text can be identified, the corresponding number of relation labels is obtained, then the relation element identification processing is carried out according to each relation label, the entity pair corresponding to each relation label is obtained, the entity pair of the appointed relation information is purposefully searched, the entity pair in the text is not randomly obtained, the accuracy of the obtained result of the entity pair is improved, and the problem that the relation information of the same or partially same entity in the subject and the object cannot be identified can be solved; in addition, the entity pairs of various relation information of the semantic processing text can be accurately identified through the relation labels, so that the problem of misjudgment or missed judgment of the relation information through the entity pairs is avoided, and the accuracy of the relation extraction result is improved.
Further optionally, through a preset relationship recognition model, relationship recognition processing (one of specific applications of label classification processing) may be performed on the semantic processing text; through a preset relation element extraction model, relation element identification processing (one of specific applications of sequence labeling processing) can be performed on the semantic processing text. Therefore, the whole relation extraction process comprises the relation identification module and the relation element extraction model, the relation identification module or the relation element extraction model can be independently optimized according to the actual situation, the parameters of the whole relation extraction task do not need to be adjusted, and the parameter adjustment efficiency is improved; the relationship identification module or the relationship element extraction model can be replaced according to the actual situation, so that the relationship extraction task is flexible and changeable, and the applicability is stronger.
In one embodiment of the present specification, the document level of the initial text S0Comprises the following steps: { Zhang three weeks one nominated Liqu is the general manager of a certain company. He chooses she because the colleague has a rich working experience as a pre-sales leader. Can be applied to the initial text S0The method for extracting the relationship element specifically comprises the following steps:
1) after the reference decomposition process by the reference decomposition tool (such as Stanford CoreNLP), the initial text S can be obtained through the output reference Chain (Chain)0The reference relationship existing in (1). For example, the output reference chain may include:
Chain 0:
"Zhang three" in sensor 1: [0, 2);
"He" in sensor 2: [0, 1);
Chain 1:
"Lisi" in sensor 1: [6, 8);
"she" in sensor 2: [3, 4);
"this colleague" in sensor 2: [7, 10);
wherein "Chain 0" and "Chain 1" represent two separate reference chains, respectively containing each component under corresponding reference relationship in the initial text S0The location information in (1), for example, "in sensor 1: [0,2) "indicates that the component" Zhang three "is at the 1 st and second place in the initial sentence of the first order, and so on, the components with the same reference relationship can be obtained in the initial text S0The location information in (1).
2) According to the preset ending symbol set { period, semicolon, question mark and exclamation mark }, the initial text S is processed0Preprocessing is carried out, the initial text is split into at least one initial sentence, and the preprocessed initial text S is obtained0={s1,s2In which s is1The third week is named Liqu as the general manager of a certain company. S, s2That he chooses she because the colleague has a rich working experience as a pre-sales leader. }.
3) At "Chain 0" andone main component is respectively selected from the "Chain 1", for example, the special terms "zhang san" and "lie san" are respectively selected as the main components of "Chain 0" and "Chain 1", which indicates that the other components in the Chain are non-main components. And according to the preset replacement principle, the initial text S is subjected to0Performing a reference resolution process to obtain an initial text S0At least one non-main component is replaced to obtain the reference resolution text. For example, the following three reference resolution texts S can be obtained1,S2And S3Where underlining is used to illustrate the replacement part, it does not exist in practical applications:
S1={s11s12},s11the third week is named Liqu as the general manager of a certain company. S, S12={Zhang threeSelectingLi fourBecause the colleague has abundant working experience as a pre-sale supervisor };
S2={s21,s22},s21the third week is named Liqu as the general manager of a certain company. S, s22={Zhang threeShe was selected becauseLi fourAbundant working experience as pre-sales leaders };
S3={s31,s32},s31the third week is named Liqu as the general manager of a certain company. S, s32={Zhang threeSelectingLi fourIs due toLi fourAnd the system has abundant working experience as a pre-sale director }.
4) Resolving the reference text S1,S2And S3And respectively carrying out entity identification processing, acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity identification results of the reference resolution sentences in the same order to obtain the screened reference resolution sentences.
For example, refer to the resolved text S1,S2And S3Is s in the first order11、s21And s31Wherein s is11、s21And s31If the entity recognition results are the same, the resolution statement s is determined according to the reference11、s21And s31Corresponding entity prediction probability matrixes can respectively obtain the reference resolution sentences s11、s21And s31The entities of (2) predict probability values, and are screened in this way, assuming that the reserved reference resolution statement s11
Also for example, refer to the resolved text S1,S2And S3Is s in the second order11、s21And s31Wherein s is12The result of entity recognition is contained in s22In the result of entity recognition of (1), s22And s32Is the same, according to s22And s32Screening the entity prediction probability value, supposing to reserve a resolution statement s22Then, s can be judged22And s32Whether the identified reference resolution statement with the entity number being the total number of the entities exists in the system or not is determined22If the number of the entities obtained by identification is the total number of the entities, s can be reserved22And delete s12And s32
5) Merging the filtered reference resolution sentences to obtain a semantic processing text S0′={s11s22}。
6) Semantically processing text S0Statement s in `11,s22Respectively carrying out existing relation recognition processing to obtain sentences s11s22Corresponding relationship labels. For example, the sentence s11The relationship tags of (a) may include: { company, job, colleague }, sentence s22The relationship tags of (a) may include: { job }.
7) Processing text S according to semantics0Statement s in `11And corresponding relational tags, and statements s22And corresponding relation labels, respectively carrying out relation element identification processing to obtain the statement s11s22The pair of relationship elements corresponding to the relationship label.
For example, the sentence s11The relationship element pair of (a) may include: and the relation label' publicThe department "a corresponding relationship element pair { Zhang three, a certain company }, a relationship element pair { Li four, a general manager }, a relationship element pair { Zhang three, Li four }, a sentence s, a sentence" at the same time ", and a relationship label" post22The relationship element pair of (a) may include: the pair of relationship elements corresponding to the relationship label "job" { lie four, pre-sales director }.
8) According to the relationship tag and the corresponding relationship element pair, a set of relationship triples can be obtained through analysis processing, and the method specifically includes:
{ "subject": Zhang III "," predict ": company", "object": certain company "};
{ "subject": Liqu "," predicate ": company", "object": certain company "};
{ "subject": lie four "," predict ": post", "object": total manager "};
{ "subject": lie four "," predict ": post", "object": front sales manager "};
{ "subject": Zhang III "," predict ": colleague", "object": Li IV ".
As an optional example, an equivalent replacement relationship label set may be set, and if the relationship label of the semantic processing text matches one equivalent replacement label in the equivalent replacement relationship label set, during parsing, a host guest replacement processing may be performed on a relationship element pair corresponding to the equivalent replacement label, so as to obtain a pair of relationship triples having an equivalent replacement relationship.
For example, an equivalence substitution relationship labelset includes: the classification label 'simultaneous' of representing the relationship information of the colleagues, and the semantic processing text is as follows: { Xiaoming and Xiaohong are colleagues. And obtaining that the co-occurrence relationship information exists in the semantic processing text through relationship identification processing, and the relationship element pair related to the co-occurrence relationship information is 'Xiaoming' and 'Xiaohong', and when the relationship element pair is subjected to parsing processing, performing primary guest replacement processing on the 'Xiaoming' and 'Xiaohong' through an equivalence replacement relationship label set so as to obtain a pair of relationship triples with equivalence replacement relationships, namely { "subject": Xiaoming "," previous ": co-occurrence", "object": Xiaohong "} and {" subject ": Xiaohong", "previous": minor "}.
The present specification further provides a text processing device, which may include a first memory and a first processor, where the first memory stores computer instructions executable on the first processor, and the first processor may execute the steps of the method according to any one of the foregoing embodiments of the present specification when executing the computer instructions.
In a specific implementation, the text processing device may further include a first display interface and a first display accessed through the first display interface. The first display may display a semantically processed text obtained by the first processor executing the text processing method provided in the embodiments of the present specification.
An embodiment of the present specification further provides an information extraction system, including the text processing apparatus and the information acquisition apparatus described in any of the above embodiments, where the text processing apparatus and the information acquisition apparatus establish a communication connection through a communication interface, where:
the information acquisition device comprises a second memory and a second processor; wherein the second memory is adapted to store one or more computer instructions that when executed by the second processor perform the steps of:
and performing information identification processing on the semantic processing text obtained by the text processing equipment, and acquiring corresponding components in the semantic processing text based on an information identification result.
In a specific implementation, the information obtaining device may further include a second display interface and a second display accessed through the second display interface. The second display may display a result of the information identification processing performed by the second processor in the information identification processing step provided in the embodiment of the present specification.
It is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or imply the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of the feature. Moreover, the terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the specification described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
The embodiment of the present invention further provides a computer-readable storage medium, on which computer instructions are stored, and when the computer instructions are executed, the steps of the method according to any of the above embodiments of the present invention may be executed. The computer readable storage medium may be various suitable readable storage media such as an optical disc, a mechanical hard disc, a solid state hard disc, and the like. The instructions stored in the computer-readable storage medium may be used to execute the method according to any of the embodiments, which may specifically refer to the embodiments described above and will not be described again.
The computer-readable storage medium may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit, for example, memory, removable or non-removable media, erasable or non-erasable media, writeable or re-writeable media, digital or analog media, hard disk, floppy disk, compact disk read Only memory (CD-ROM), compact disk recordable (CD-R), compact disk Rewriteable (CD-RW), optical disk, magnetic media, magneto-optical media, removable memory cards or disks, various types of Digital Versatile Disk (DVD), a tape, a cassette, or the like.
The computer instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language.
Although the embodiments of the present specification are disclosed above, the embodiments of the present specification are not limited thereto. Various changes and modifications may be effected by one skilled in the art without departing from the spirit and scope of the embodiments herein described, and it is intended that the scope of the embodiments herein described be limited only by the scope of the appended claims.

Claims (10)

1. A method of text processing, comprising:
identifying a reference relationship existing in the initial text;
performing reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text;
respectively carrying out entity recognition processing on each reference resolution text;
acquiring the reference resolution sentences in the same order in each reference resolution text, and screening the reference resolution sentences in the same order based on the entity recognition results of the reference resolution sentences in the same order;
and merging the filtered reference resolution sentences to obtain a semantic processing text.
2. The text processing method according to claim 1, wherein the performing the reference resolution processing on the initial text based on the identified reference relationship to obtain a processed reference resolution text comprises:
acquiring the components of the reference relationship from the initial text, and selecting the main components of the reference relationship from the components;
and acquiring part or all of the components with the same reference relation with the main components from the initial text, replacing the components with the main components, and obtaining a processed reference resolution text.
3. The text processing method according to claim 1 or 2, wherein the performing entity recognition processing on each reference resolution text respectively comprises:
respectively inputting each reference resolution text into a preset entity recognition model to obtain an entity prediction probability matrix of each reference resolution text;
determining distribution positions which accord with a preset first condition in the entity prediction probability matrix of each reference resolution text, and taking components of the corresponding distribution positions in each reference resolution text as entities to obtain an entity recognition result.
4. The text processing method according to claim 3, further comprising, before the performing entity identification processing on each of the reference resolution texts, respectively:
inputting a preset training corpus and an entity true probability matrix of the training corpus into the entity recognition model for training to obtain an entity prediction probability matrix of the training corpus;
performing error calculation based on the entity prediction probability matrix of the training corpus and the entity real probability matrix of the training corpus to obtain a result error value;
and if the result error value meets a preset training completion condition, finishing the training of the entity recognition model, otherwise, adjusting the parameters of the entity recognition model, and inputting the training corpus and the entity true probability matrix of the training corpus into the entity recognition model after the parameters are adjusted to train until the entity recognition model completes the training.
5. The text processing method of claim 4, wherein the entity prediction probability matrices for each reference to the resolved text comprise: the first prediction probability vector is used for representing the entity prediction starting position in each reference resolution text and the second prediction probability vector is used for representing the entity prediction ending position in each reference resolution text;
the determining the distribution positions in the entity prediction probability matrix of each reference resolution text, which meet a preset first condition, and taking the components of the corresponding distribution positions in each reference resolution text as entities, includes:
comparing the first prediction probability vector with a preset first threshold value, determining distribution positions with probability values larger than the first threshold value in the first prediction probability vector, and obtaining entity prediction initial position distribution information of each reference resolution text;
comparing the second prediction probability vector with a preset second threshold value, determining distribution positions of which the probability values in the second prediction probability vector are larger than the second threshold value, and obtaining entity prediction end position distribution information of each reference resolution text;
and obtaining entity distribution position intervals of each reference resolution text based on the entity prediction starting position distribution information and the entity prediction ending position distribution information of each reference resolution text, and obtaining components in the corresponding distribution position intervals in each reference resolution text as entities.
6. The method of claim 5, wherein the entity prediction probability matrix of the corpus comprises: a third prediction probability vector used for representing the entity prediction starting position in the training corpus and a fourth prediction probability vector used for representing the entity prediction ending position in the training corpus; the entity true probability matrix of the training corpus comprises: a first real probability vector used for representing a real starting position of an entity in the training corpus and a second real probability vector used for representing a real ending position of the entity in the training corpus;
the error calculation is performed on the entity prediction probability matrix based on the corpus and the entity real probability matrix based on the corpus to obtain a result error value, and the error calculation comprises the following steps:
error calculation is performed by using the following loss function to obtain a result error value:
Figure FDA0002537556930000021
wherein, ysiThe ith probability value in the first real probability vector; y iseiThe ith probability value in the second real probability vector;
Figure FDA0002537556930000022
is the third prediction probabilityThe ith probability value in the vector;
Figure FDA0002537556930000023
is the ith probability value in the fourth prediction probability vector; i is a natural number.
7. An information extraction method, comprising:
obtaining a semantically processed text by the text processing method of any one of claims 1-6;
and performing information identification processing on the semantic processing text, and extracting corresponding components in the semantic processing text based on an information identification result.
8. A text processing apparatus includes a first memory and a first processor; wherein the first memory is adapted to store one or more computer instructions, wherein the first processor when executing the computer instructions performs the steps of the method of any one of claims 1 to 6.
9. An information extraction system comprising the text processing apparatus and the information acquisition apparatus of claim 8, wherein:
the information acquisition device comprises a second memory and a second processor; wherein the second memory is adapted to store one or more computer instructions that when executed by the second processor perform the steps of:
and performing information identification processing on the semantic processing text obtained by the text processing equipment, and acquiring corresponding components in the semantic processing text based on an information identification result.
10. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions when executed perform the steps of the method of any one of claims 1 to 7.
CN202010537652.4A 2020-06-12 2020-06-12 Text processing method and device, information extraction method and system, and medium Pending CN111695054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010537652.4A CN111695054A (en) 2020-06-12 2020-06-12 Text processing method and device, information extraction method and system, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010537652.4A CN111695054A (en) 2020-06-12 2020-06-12 Text processing method and device, information extraction method and system, and medium

Publications (1)

Publication Number Publication Date
CN111695054A true CN111695054A (en) 2020-09-22

Family

ID=72480710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010537652.4A Pending CN111695054A (en) 2020-06-12 2020-06-12 Text processing method and device, information extraction method and system, and medium

Country Status (1)

Country Link
CN (1) CN111695054A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463942A (en) * 2020-12-11 2021-03-09 深圳市欢太科技有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN112541346A (en) * 2020-12-24 2021-03-23 北京百度网讯科技有限公司 Abstract generation method and device, electronic equipment and readable storage medium
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168947A (en) * 2017-04-19 2017-09-15 成都准星云学科技有限公司 A kind of method and its system of new entity reference resolution
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN109325098A (en) * 2018-08-23 2019-02-12 上海互教教育科技有限公司 Reference resolution method for the parsing of mathematical problem semanteme
CN110705206A (en) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 Text information processing method and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168947A (en) * 2017-04-19 2017-09-15 成都准星云学科技有限公司 A kind of method and its system of new entity reference resolution
CN109325098A (en) * 2018-08-23 2019-02-12 上海互教教育科技有限公司 Reference resolution method for the parsing of mathematical problem semanteme
CN109271631A (en) * 2018-09-12 2019-01-25 广州多益网络股份有限公司 Segmenting method, device, equipment and storage medium
CN110705206A (en) * 2019-09-23 2020-01-17 腾讯科技(深圳)有限公司 Text information processing method and related device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463942A (en) * 2020-12-11 2021-03-09 深圳市欢太科技有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN112541346A (en) * 2020-12-24 2021-03-23 北京百度网讯科技有限公司 Abstract generation method and device, electronic equipment and readable storage medium
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN116776886B (en) * 2023-08-15 2023-12-05 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US11568143B2 (en) Pre-trained contextual embedding models for named entity recognition and confidence prediction
US10169325B2 (en) Segmenting and interpreting a document, and relocating document fragments to corresponding sections
CN107644011B (en) System and method for fine-grained medical entity extraction
CN112015859A (en) Text knowledge hierarchy extraction method and device, computer equipment and readable medium
CN109697291B (en) Text semantic paragraph recognition method and device
US20180366013A1 (en) System and method for providing an interactive visual learning environment for creation, presentation, sharing, organizing and analysis of knowledge on subject matter
US10176890B2 (en) Segmenting and interpreting a document, and relocating document fragments to corresponding sections
CN111695054A (en) Text processing method and device, information extraction method and system, and medium
CN113051356B (en) Open relation extraction method and device, electronic equipment and storage medium
CN112241626A (en) Semantic matching and semantic similarity model training method and device
US11194963B1 (en) Auditing citations in a textual document
US11645447B2 (en) Encoding textual information for text analysis
US11507746B2 (en) Method and apparatus for generating context information
CN111695053A (en) Sequence labeling method, data processing device and readable storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN111274829A (en) Sequence labeling method using cross-language information
CN114416995A (en) Information recommendation method, device and equipment
CN116484420A (en) Text desensitization processing method and device
CN116467417A (en) Method, device, equipment and storage medium for generating answers to questions
KR102185733B1 (en) Server and method for automatically generating profile
CN114611520A (en) Text abstract generating method
Pirovani et al. Adapting NER (CRF+ LG) for Many Textual Genres.
CN117150436B (en) Multi-mode self-adaptive fusion topic identification method and system
CN113887191A (en) Method and device for detecting similarity of articles
CN113705207A (en) Grammar error recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination