WO2020119075A1 - Procédé et appareil d'extraction d'informations de texte général, dispositif informatique et support d'informations - Google Patents

Procédé et appareil d'extraction d'informations de texte général, dispositif informatique et support d'informations Download PDF

Info

Publication number
WO2020119075A1
WO2020119075A1 PCT/CN2019/093158 CN2019093158W WO2020119075A1 WO 2020119075 A1 WO2020119075 A1 WO 2020119075A1 CN 2019093158 W CN2019093158 W CN 2019093158W WO 2020119075 A1 WO2020119075 A1 WO 2020119075A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeling
target
text
model
syntactic
Prior art date
Application number
PCT/CN2019/093158
Other languages
English (en)
Chinese (zh)
Inventor
郑子欧
刘媛源
张翔
于修铭
汪伟
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020119075A1 publication Critical patent/WO2020119075A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • the present application relates to a general text information extraction method, device, computer equipment and storage medium.
  • a general text information extraction method including:
  • the target extraction information corresponding to the text to be processed is determined according to the annotated text and the syntactic and semantic analysis results.
  • a general text information extraction device including:
  • a rule acquisition module which is used to input the model training samples into a labeling model for labeling to obtain labeling rules corresponding to the model training samples;
  • the text labeling module is used to establish a basic labeling model according to the labeling rules, input the text to be processed into the basic labeling model for labeling, and obtain a labeling sequence;
  • the text determination module is used to obtain a sequence digestion rule corresponding to the annotation sequence, and determine the annotation text corresponding to the annotation sequence according to the sequence digestion rule;
  • a feature acquisition module for acquiring target syntactic features and target semantic features in the annotated text
  • a syntactic and semantic analysis module used to input the target syntactic features and the target semantic features into a trained syntactic and semantic analysis model for analysis to obtain syntactic and semantic analysis results corresponding to the marked text;
  • the target information extraction module is configured to determine target extraction information corresponding to the text to be processed according to the marked text and the syntactic and semantic analysis results.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors are executed The following steps:
  • the target extraction information corresponding to the text to be processed is determined according to the annotated text and the syntactic and semantic analysis results.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the target extraction information corresponding to the text to be processed is determined according to the annotated text and the syntactic and semantic analysis results.
  • FIG. 1 is an application environment diagram of a general text information extraction method according to one or more embodiments
  • FIG. 2 is a method flowchart of a general text information extraction method according to one or more embodiments
  • FIG. 3 is a flowchart of a method for acquiring an annotation sequence in a general text information extraction method according to one or more embodiments
  • FIG. 4 is a flowchart of a method for acquiring target features in a general text information extraction method according to one or more embodiments
  • FIG. 5 is a schematic structural diagram of a general text information extraction method device according to one or more embodiments.
  • Figure 6 is a block diagram of a computer device in accordance with one or more embodiments.
  • the general text information extraction method provided in the embodiment of the present invention can be applied to the application environment shown in FIG. 1.
  • the server 120 obtains model training samples and text to be processed.
  • the model training samples and text to be processed can be input to the terminal 110 or
  • the server 120 inputs the model training samples into the labeling model for labeling, and can obtain labeling rules corresponding to the model training samples.
  • the server 120 establishes a basic labeling model according to the labeling rules, and inputs the text to be processed into the basic labeling model for labeling.
  • the server 120 obtains the sequence digestion rule corresponding to the labeling sequence, determines the labeling text corresponding to the labeling sequence according to the sequence digestion rule, obtains the target syntactic features and target semantic features in the labeling text through the server 120, and then sets the target syntactic features Input the trained syntactic and semantic analysis model with the target semantic feature and analyze to obtain the syntactic and semantic analysis result corresponding to the marked text.
  • the server 120 determines the target extraction information corresponding to the text to be processed according to the marked text and the syntactic and semantic analysis result.
  • the following embodiment uses the general text information extraction method applied to the server 102 in FIG. 1 as an example for description, but it should be noted that, in actual application, the method is not limited to the above server.
  • FIG. 2 it is a flowchart of a general text information extraction method in an embodiment. The method specifically includes the following steps:
  • Step 202 Obtain model training samples and text to be processed.
  • Model training samples are used to obtain labeling rules and establish basic labeling models.
  • the number of model training samples is less than a preset threshold. In one of the embodiments, there may be 3 to 5 model training samples.
  • the text to be processed is a sample of the same type as the model training sample, and the target extraction information exists in the text to be processed.
  • the training sample and the text to be processed may be text information in various fields, such as various types of contracts, resumes, and web page source text information.
  • the model training sample and the text to be processed may be samples input by end users, such as end users. Text entered through user interactive devices such as keyboards and touch screens can also be samples obtained online.
  • step 204 the model training samples are input into the labeling model for labeling, and labeling rules corresponding to the model training samples are obtained.
  • the labeling rules are used for transfer learning of the text to be processed.
  • the text to be processed is a text of a major contract
  • the user provides a sample, extracts the information for the required field (such as Party A) as a task, and online A small number of samples (such as 5) are labeled, and after learning modeling, the information of the same field (such as Party A) can be extracted from other contract samples provided by the user.
  • the labeling method adopted by the trained labeling model is sequence labeling, which converts the problem of text information extraction into sequence labeling, marking all unrelated text in the text as O, and marking the first character of the correct label as B-target. Others are marked as I-target.
  • Step 206 Establish a basic labeling model according to the labeling rules, input the text to be processed into the basic labeling model for labeling, and obtain a labeling sequence.
  • the basic labeling model includes the labeling rules used to label the model training samples.
  • the process of entering the text to be processed into the basic labeling model for labeling is the process of transfer learning. Further learning the rules used for labeling the model training samples, that is, applying the labeling rules to the text to be processed for labeling can improve the efficiency of the labeling process and obtain 'S labeling sequence is more accurate.
  • Step 208 Acquire sequence digestion rules corresponding to the annotation sequence, and determine the annotation text corresponding to the annotation sequence according to the sequence digestion rules.
  • the annotated text is the field information corresponding to the annotated sequence and existing in the text to be processed.
  • the sequence digestion rule refers to the digestion rule, and the annotation sequence is obtained by annotating the text to be processed, and then the annotation sequence is used to further locate each annotation text. For example, when the obtained labeling sequence is "this (O) year (O) sea (O) fishing (O) than (O) game (O) in (O) building (B-LOC) gate (I-LOC) city Between (E-LOC) and (O)Gold (B-LOC) gate (E-LOC), between (O) (O) sea (O) domain (O) (O) line (O).
  • the labeling text corresponding to the names of people, places, organizations and other information is obtained through the labeling sequence, that is, the labeling text "This year's sea fishing competition will be held in the waters between Xiamen and Kinmen.” Specifically, confirm The annotated text corresponding to the to-be-processed text is the actual object of finding the pronouns in the contract announcement using the referential resolution.
  • the referential resolution is the problem of determining which noun phrase the pronoun points to in the chapter.
  • the referent can be understood as being present in the annotation sequence
  • the actual object is the label text.
  • Step 210 Obtain target syntactic features and target semantic features in the marked text.
  • the target syntactic features and target semantic features are the feature information existing in the text to be processed, which is used to input the trained syntactic semantic analysis model for syntactic analysis and semantic analysis.
  • Step 212 Input the target syntactic features and target semantic features into the trained syntactic and semantic analysis model for analysis, and obtain the syntactic and semantic analysis results corresponding to the marked text.
  • the trained syntactic and semantic analysis models include syntactic feature analysis and shallow semantic feature analysis.
  • Syntactic feature analysis is the process of analyzing the input text sentence to get the sentence syntactic structure.
  • Syntactic analysis can be divided into the following three types: (1) Phrase structure syntactic analysis, the function is to identify the phrase structure in the sentence and the hierarchical syntactic relationship between the phrases; (2) Dependency syntactic analysis, the function is to identify the vocabulary and vocabulary in the sentence The interdependence between; (3) Deep grammatical syntax analysis, deep syntax and semantic analysis of sentences.
  • Shallow semantic feature analysis refers to the use of various machine learning methods to learn and understand the semantic content represented by a paragraph of text. A piece of text is usually composed of words, sentences, and paragraphs.
  • semantic analysis can be further decomposed into lexical-level semantic analysis, sentence-level semantic analysis, and chapter-level semantic analysis.
  • lexical-level semantic analysis focuses on how to obtain or distinguish the semantics of words.
  • Sentence-level semantic analysis attempts to analyze the semantics expressed by the entire sentence, while textual semantic analysis studies the internal structure of the text and understands the text unit (which can be a sentence) Semantic relations between clauses or paragraphs).
  • target syntactic features and target semantic features into the trained syntactic semantic analysis model for analysis, the results of syntactic semantic analysis corresponding to the marked text can be obtained.
  • Step 214 Determine target extraction information corresponding to the text to be processed according to the marked text and the results of syntactic and semantic analysis.
  • the target extraction information is field information existing in the text to be processed.
  • the target extraction information may be the field of “Party A”.
  • the model training sample and the text to be processed are obtained, and then the model training sample is input into the labeling model for labeling to obtain a labeling rule corresponding to the model training sample.
  • the labeling rule can migrate and label the text to be processed, and then according to the labeling rule Establish a basic labeling model, input the text to be processed into the basic labeling model for labeling, and obtain a labeling sequence, which can provide prerequisites for subsequent syntactic and semantic analysis. Marked text, and then obtain the target syntactic features and target semantic features in the marked text.
  • target syntactic features and target semantic features can extract different types of text under the annotation of only a small number of samples, and then the target syntactic features and targets Semantic features are input to the trained syntactic and semantic analysis model for analysis to obtain syntactic and semantic analysis results corresponding to the marked text, which can generate accurate syntactic analysis and semantic analysis corresponding to the marked text, and finally determine based on the marked text and syntactic semantic analysis results.
  • the target extraction information corresponding to the text to be processed can realize the extraction of various types of text information with only a small number of samples.
  • the method further includes the following steps:
  • Step 302 Input the text to be processed into the trained word segmentation model for word segmentation to obtain a word segmentation result.
  • Step 304 Obtain word segmentation error resolution rules corresponding to the text to be processed.
  • Word segmentation error resolution rules are used to resolve errors that occur in the process of word segmentation.
  • Word segmentation error resolution rules include word segmentation ambiguity resolution, new word recognition, and standardization of erroneous words (and homophonic characters).
  • Word segmentation ambiguity resolution refers to a word string can have different segmentation methods in a sentence.
  • tablette/racket/sold out can be divided into “pingpong/racket/sold/finished/sold", or “table tennis/auction/done/finished”, the new word recognition refers to not being in the training data Words that have appeared in, including people's new words and old words, new words, homonyms are standardized, and there will inevitably be some typos or deliberate homonyms in the input sentence (such as " ⁇ " -> "want to cry”; “Blue Slim” -> “Uncomfortable”; “Blue Mushroom” -> “Sad” and so on).
  • the target word segmentation information is obtained through the word segmentation error elimination rules, and then the target word segmentation information is annotated, which can better label the text information, so as to achieve the purpose of extracting the text information more accurately.
  • Step 306 Filter the word segmentation results according to the word segmentation error elimination rules to obtain target word segmentation information.
  • word segmentation error resolution process there will be a variety of word segmentation results, such as "table tennis racket / sold out”, “ping pong / racket / sold / finished / finished", "table tennis / auction / finished / finished”.
  • the word segmentation results are filtered by the word segmentation error elimination rules to obtain the target word segmentation information.
  • the above word segmentation results are filtered by the word segmentation error elimination rules to obtain the target word segmentation information: ping pong/racket/sell/finished/.
  • Step 308 Input the target word segmentation information into the basic labeling model for labeling to obtain a labeling sequence.
  • Inputting the target word segmentation information obtained by filtering into the basic labeling model for labeling can obtain a more accurate labeling sequence, which will be more accurate when the target extraction information is subsequently extracted.
  • the word segmentation results are obtained by inputting the text to be processed into the trained word segmentation model, and then the word segmentation results are filtered using the word segmentation error resolution rules to obtain the target word segmentation information.
  • the target text segmentation process is the target text segmentation process Information input to the basic labeling model for labeling can obtain a more accurate labeling sequence and improve the efficiency and accuracy of information extraction.
  • the method further includes: displaying the target extraction information, obtaining the information update result corresponding to the target extraction information; inputting the information update result into the syntax analysis model for analysis to obtain the updated syntax analysis result;
  • the updated syntax and semantic analysis results update the syntax analysis rules and semantic analysis rules; store the updated syntax analysis rules and semantic analysis rules to the syntax analysis model.
  • the information update result is text information obtained after the target extraction information is modified, added, and deleted by the terminal when the target extraction information is displayed on the terminal. Enter the modified, added, and deleted text information into the syntactic and semantic analysis model for analysis to obtain the updated syntactic and semantic analysis results. Update the syntactic analysis rules and semantic analysis rules through the updated syntactic and semantic analysis results, and enter the updated syntax
  • the analysis rules and the semantic analysis rules are stored in the syntactic and semantic analysis model, which can realize the online learning process, and further update the syntactic and semantic analysis model through the active modification of the terminal, so as to improve the accuracy rate in the extraction of general text information.
  • the method further includes the following steps:
  • Step 402 Obtain syntactic and semantic features in the marked text.
  • Syntactic features include phrase structure: verb phrases, noun phrases. Syntactic features also include syntactic dependence: subject components such as subject-predicate-object. Semantic features include lexical-level semantics, sentence-level semantics, and chapter-level semantics.
  • Step 404 input the syntactic features and semantic features into the trained feature thinning model for feature thinning, to obtain thinned syntactic features and thinned semantic features.
  • the trained feature refinement model is used to extract the features of smaller categories of syntactic and semantic features.
  • Refinement syntactic features are features of a smaller class in syntactic features
  • refining semantic features are features of a smaller class in semantic features.
  • Step 406 input the refined text syntactic features and the refined text semantic features into the decision tree model corresponding to the text to be processed, and perform importance ranking to obtain the feature ranking result.
  • Decision tree model is used to obtain the importance ranking of features.
  • Decision tree model is a very common classification method.
  • the decision tree model is a kind of supervised learning. Supervised learning is given a bunch of samples, each sample has a set of attributes and a category, these categories are determined in advance, then through learning to get a classifier, this classifier can The emerged objects are given correct classification. Specifically, the importance ranking is sorted by the importance threshold. When the importance is greater than the preset importance threshold, the refined text syntax features and the refined text semantic features are filtered to obtain the feature ranking results. .
  • Step 408 Determine the target syntactic features and target semantic features according to the feature ranking results.
  • the result of feature ranking further determines the target syntactic features and target semantic features. Using target syntactic features and target semantic features to analyze the annotated text can extract text information more accurately.
  • the syntactic features and semantic features in the marked text by obtaining the syntactic features and semantic features in the marked text, and then input the syntactic features and semantic features into the trained feature refinement model for feature refinement, it is possible to obtain refined syntactic features and refined semantic features, and then Further, the refined text syntactic features and refined text semantic features are input into the decision tree model corresponding to the text to be processed, and the feature ranking results are obtained. Finally, according to the feature ranking results, the target syntactic features and target semantic features can be determined. Using syntactic and semantic analysis can extract different types of text information.
  • the method further includes: dividing the model training samples into training samples, verification samples, and test samples; inputting the training samples into the training set corresponding to the basic labeling model for training to obtain target training samples; and converting the target
  • the training sample is input into the verification set corresponding to the basic annotation model for verification to obtain the target verification sample
  • the target verification sample is input into the test set corresponding to the basic annotation model for testing to obtain the target test sample
  • the basic annotation model is updated according to the target test sample.
  • the model training samples can be divided into training samples, verification samples and test samples.
  • verification set: test set 6:2:2-> all samples, the samples between training, verification, test set and the sample The text types do not coincide. It can be understood that the training set is used to train the basic labeling model, and the subsequent combination of the role of the verification set will select different values of the same parameter. Input training samples into the training set for training, you can get the target training samples.
  • the validation set is to train multiple models through the training set. In order to find out the most effective basic labeling model, use each basic labeling model to predict the validation set data, and record the model accuracy rate to select the best effect.
  • the parameters corresponding to the basic labeling model are used to adjust the model parameters, that is, the target training sample is input into the verification set corresponding to the basic labeling model for verification to obtain the target verification sample.
  • the test set uses the training set and the verification set to obtain the optimal basic labeling model, and then uses the test set to make model predictions to measure the performance and classification ability of the optimal basic labeling model. That is, the test set can be regarded as never existing.
  • After the model parameters have been determined use the test set to evaluate the model performance, and use the target test samples obtained after the test set test to update the basic annotation model.
  • the target training samples can be obtained, and then the target training samples are input
  • the verification set corresponding to the basic annotation model is verified to further obtain the target verification sample.
  • the target verification sample is input into the test set corresponding to the basic annotation model for testing to obtain the target test sample, and then the basic annotation model is updated according to the target test sample. Updating the basic annotation model is conducive to extracting different types of text information.
  • steps in the flowcharts of FIGS. 2-4 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or stages The execution order of is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • FIG. 5 it is a schematic diagram of a general text information extraction device in an embodiment.
  • the device includes:
  • the information obtaining module 502 is used to obtain model training samples and text to be processed
  • the rule acquisition module 504 is used to input model training samples into the labeling model for labeling to obtain labeling rules corresponding to the model training samples;
  • the text labeling module 506 is used to establish a basic labeling model according to the labeling rules, input the text to be processed into the basic labeling model for labeling, and obtain a labeling sequence;
  • the text determination module 508 is used to obtain a sequence digestion rule corresponding to the annotation sequence, and determine the annotation text corresponding to the annotation sequence according to the sequence digestion rule;
  • the feature obtaining module 510 is used to obtain target syntactic features and target semantic features in the marked text
  • Syntactic and semantic analysis module 512 which is used to input the target syntactic features and target semantic features into the trained syntactic and semantic analysis model for analysis to obtain the syntactic and semantic analysis results corresponding to the marked text;
  • the target information extraction module 514 is used to determine target extraction information corresponding to the text to be processed according to the marked text and the results of syntactic and semantic analysis.
  • the text labeling module includes: a text word segmentation module for inputting the text to be processed into a trained word segmentation model for word segmentation to obtain a word segmentation result; a text digestion module for acquiring word segmentation errors corresponding to the text to be processed Digestion rules; target word segmentation acquisition module, used to filter word segmentation results according to word segmentation error digestion rules to obtain target word segmentation information; sequence acquisition module, used to input target word segmentation information into the basic labeling model for labeling to obtain labeling sequences.
  • the target information extraction module includes: an information update module for displaying the target extraction information and obtaining information update results corresponding to the target extraction information; and an information analysis module for inputting the information update results into the syntax analysis model Perform analysis to obtain updated syntax and semantic analysis results; rule update module, which is used to update syntax analysis rules and semantic analysis rules based on the updated syntax and semantic analysis results; rule storage module, which is used to update the updated syntax analysis rules and semantics The analysis rules are stored in the syntax analysis model.
  • the target syntactic feature and target semantic feature determination module is used to obtain the syntactic and semantic features in the annotated text; input the syntactic and semantic features into the trained feature thinning model for feature thinning to obtain fine Syntactic features and refined semantic features; input the refined text syntactic features and refined text semantic features into the decision tree model corresponding to the text to be processed, and obtain feature ranking results; determine the target syntactic features according to the feature ranking results And target semantic features.
  • the basic labeling model update module is used to divide the model training samples into training samples, verification samples and test samples; input the training samples into the training set corresponding to the basic labeling model for training to obtain target training samples; Enter the target training sample into the verification set corresponding to the basic labeling model to obtain the target verification sample; enter the target verification sample into the test set corresponding to the basic labeling model to test to obtain the target test sample; update the basic labeling model according to the target test sample .
  • Each module in the above-mentioned general text information extraction device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above modules may be embedded in the hardware or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • the processor may be a central processing unit (CPU), a microprocessor, a single-chip microcomputer, or the like.
  • the above general text information extraction device may be implemented in a form of computer readable instructions.
  • a computer device is provided, and the computer device may be a server or a terminal.
  • the computer device When the computer device is a terminal, its internal structure diagram may be as shown in FIG. 6.
  • the computer device includes a processor, memory, and network interface connected by a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer-readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • the computer-readable instructions are executed by the processor to implement a general text information extraction method.
  • FIG. 6 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may It includes more or fewer components than shown in the figure, or some components are combined, or have a different component arrangement.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors perform the following steps:
  • the syntactic and semantic analysis model analyzes to obtain the syntactic and semantic analysis results corresponding to the marked text; and determines the target extraction information corresponding to the text to be processed according to the marked text and the syntactic and semantic analysis results.
  • the processor may also implement the following steps when executing the computer-readable instructions: input the text to be processed into the trained word segmentation model for word segmentation to obtain a word segmentation result; obtain a word segmentation error resolution rule corresponding to the text to be processed; Filter word segmentation results according to word segmentation error elimination rules to obtain target word segmentation information; and input target word segmentation information into the basic labeling model for labeling to obtain labeling sequences.
  • the processor can also implement the following steps when executing the computer-readable instructions: display the target extraction information, obtain the information update result corresponding to the target extraction information; enter the information update result into the syntax analysis model for analysis, and obtain The updated syntactic and semantic analysis results; update the syntactic analysis rules and semantic analysis rules according to the updated syntactic and semantic analysis results; and store the updated syntactic analysis rules and semantic analysis rules to the syntactic and semantic analysis model.
  • the processor when the processor executes the computer-readable instructions, the following steps may also be implemented: obtaining syntactic and semantic features in the labeled text; inputting the syntactic and semantic features into the trained feature refinement model for feature refinement , Get the refined syntactic features and refined semantic features; input the refined text syntactic features and refined text semantic features into the decision tree model corresponding to the text to be processed to obtain importance ranking results; and according to the feature ranking results Determine the target syntactic features and target semantic features.
  • the model training samples are divided into training samples, verification samples, and test samples; the training samples are input into the training set corresponding to the basic annotation model for Training to get the target training samples; input the target training samples into the verification set corresponding to the basic annotation model for verification to obtain target verification samples; enter the target verification samples into the test set corresponding to the basic annotation model for testing to obtain target test samples; and Update the basic annotation model based on the target test sample.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps: obtain model training samples And the text to be processed; input the model training sample into the labeling model for labeling to obtain the labeling rule corresponding to the model training sample; establish a basic labeling model according to the labeling rule, input the text to be processed into the basic labeling model for labeling, and obtain the labeling sequence; obtain and Sequence digestion rules corresponding to annotated sequences, determine the annotated text corresponding to annotated sequences according to the sequence digestion rules; obtain target syntactic features and target semantic features in annotated texts; input target syntactic features and target semantic features into a trained syntactic semantic analysis model The analysis is performed to obtain the syntax and semantic analysis results corresponding to the marked text; and the target extraction information corresponding to the text to be processed is determined according to the marked text and the syntax and semantic analysis results.
  • the following steps may also be implemented: input the text to be processed into a trained word segmentation model for word segmentation, and obtain a word segmentation result; obtain a word segmentation error resolution corresponding to the text to be processed Rules; filter word segmentation results according to word segmentation error elimination rules to obtain target word segmentation information; and input target word segmentation information into the basic labeling model for labeling to obtain labeling sequences.
  • the following steps may be implemented: displaying the target extraction information, obtaining the information update result corresponding to the target extraction information; and inputting the information update result into a syntax analysis model for analysis To get the updated syntax and semantic analysis results; update the syntax analysis rules and semantic analysis rules according to the updated syntax and semantic analysis results; and store the updated syntax analysis rules and semantic analysis rules to the syntax and semantic analysis model.
  • the following steps may also be implemented: obtaining syntactic and semantic features in the marked text; inputting the syntactic and semantic features into the trained feature refinement model for features Refinement to obtain refined syntactic features and refined semantic features; input refined text syntactic features and refined text semantic features into the decision tree model corresponding to the text to be processed to obtain importance ranking results; and according to the features
  • the sorting result determines the target syntactic features and target semantic features.
  • the model training samples are divided into training samples, verification samples, and test samples; the training samples are input to the training corresponding to the basic labeling model Set training to get the target training sample; enter the target training sample into the verification set corresponding to the basic annotation model to verify to obtain the target verification sample; enter the target verification sample into the test set corresponding to the basic annotation model to test to obtain the target test sample ; And update the basic annotation model based on the target test sample.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

Procédé d'extraction d'informations de texte général, consistant à : entrer un échantillon de formation de modèle dans un modèle de marquage pour le marquage afin d'obtenir une règle de marquage correspondant à l'échantillon de formation de modèle ; établir un modèle de marquage de base selon la règle de marquage, et entrer un texte à traiter dans le modèle de marquage de base pour le marquage afin d'obtenir une séquence de marquage ; obtenir une règle de digestion de séquence correspondant à la séquence de marquage, et déterminer un texte marqué correspondant à la séquence de marquage selon la règle de digestion de séquence ; obtenir une caractéristique syntaxique cible et une caractéristique sémantique cible dans le texte marqué ; entrer la caractéristique syntaxique cible et la caractéristique sémantique cible dans un modèle d'analyse syntaxique et sémantique formé pour une analyse pour obtenir des résultats d'analyse syntaxique et sémantique correspondant au texte marqué ; et déterminer des informations d'extraction cibles correspondant au texte à traiter selon le texte marqué et les résultats d'analyse syntaxique et sémantique.
PCT/CN2019/093158 2018-12-10 2019-06-27 Procédé et appareil d'extraction d'informations de texte général, dispositif informatique et support d'informations WO2020119075A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811504386.4 2018-12-10
CN201811504386.4A CN109766540B (zh) 2018-12-10 2018-12-10 通用文本信息提取方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020119075A1 true WO2020119075A1 (fr) 2020-06-18

Family

ID=66451407

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093158 WO2020119075A1 (fr) 2018-12-10 2019-06-27 Procédé et appareil d'extraction d'informations de texte général, dispositif informatique et support d'informations

Country Status (2)

Country Link
CN (1) CN109766540B (fr)
WO (1) WO2020119075A1 (fr)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754352A (zh) * 2020-06-22 2020-10-09 平安资产管理有限责任公司 一种观点语句正确性的判断方法、装置、设备和存储介质
CN111797629A (zh) * 2020-06-23 2020-10-20 平安医疗健康管理股份有限公司 医疗文本数据的处理方法、装置、计算机设备和存储介质
CN111814487A (zh) * 2020-07-17 2020-10-23 科大讯飞股份有限公司 一种语义理解方法、装置、设备及存储介质
CN111931515A (zh) * 2020-08-10 2020-11-13 鼎富智能科技有限公司 基于合同纠纷判决书的合同条款效力分析方法及装置
CN111966807A (zh) * 2020-08-18 2020-11-20 中国银行股份有限公司 问答系统的文本处理方法及装置
CN112016451A (zh) * 2020-08-27 2020-12-01 贵州师范大学 一种用于迁移学习的训练样本标注成本削减方法
CN112036179A (zh) * 2020-08-28 2020-12-04 南京航空航天大学 基于文本分类与语义框架的电力预案信息抽取方法
CN112069319A (zh) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112269884A (zh) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 信息抽取方法、装置、设备及存储介质
CN112307908A (zh) * 2020-10-15 2021-02-02 武汉科技大学城市学院 一种视频语义提取方法及装置
CN112329427A (zh) * 2020-11-26 2021-02-05 北京百度网讯科技有限公司 短信样本的获取方法和装置
CN112507702A (zh) * 2020-12-03 2021-03-16 北京百度网讯科技有限公司 文本信息的抽取方法、装置、电子设备及存储介质
CN112560497A (zh) * 2020-12-10 2021-03-26 科大讯飞股份有限公司 语义理解方法、装置、电子设备和存储介质
CN112613501A (zh) * 2020-12-21 2021-04-06 深圳壹账通智能科技有限公司 信息审核分类模型的构建方法和信息审核方法
CN112699688A (zh) * 2021-01-08 2021-04-23 北京理工大学 一种篇章关系可控的文本生成方法和系统
CN113111650A (zh) * 2021-04-16 2021-07-13 中国工商银行股份有限公司 文本处理方法、装置、系统及存储介质
CN113222149A (zh) * 2021-05-31 2021-08-06 联仁健康医疗大数据科技股份有限公司 模型训练方法、装置、设备和存储介质
CN113268601A (zh) * 2021-03-02 2021-08-17 安徽淘云科技股份有限公司 信息提取方法、阅读理解模型训练方法及相关装置
CN113408296A (zh) * 2021-06-24 2021-09-17 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN113487617A (zh) * 2021-07-26 2021-10-08 推想医疗科技股份有限公司 数据处理方法、装置、电子设备以及存储介质
CN113808758A (zh) * 2021-08-31 2021-12-17 联仁健康医疗大数据科技股份有限公司 一种检验数据标准化的方法、装置、电子设备和存储介质
CN113806492A (zh) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 基于语义识别的记录生成方法、装置、设备及存储介质
CN113823271A (zh) * 2020-12-18 2021-12-21 京东科技控股股份有限公司 语音分类模型的训练方法、装置、计算机设备及存储介质
CN114020877A (zh) * 2021-11-18 2022-02-08 中科雨辰科技有限公司 一种用于标注文本的数据处理系统
CN114119976A (zh) * 2021-11-30 2022-03-01 广州文远知行科技有限公司 语义分割模型训练、语义分割的方法、装置及相关设备
CN115495541A (zh) * 2022-11-18 2022-12-20 深译信息科技(珠海)有限公司 语料数据库、语料数据库的维护方法、装置、设备和介质
CN115879421A (zh) * 2023-02-16 2023-03-31 之江实验室 一种增强bart预训练任务的句子排序方法及装置

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766540B (zh) * 2018-12-10 2022-05-03 平安科技(深圳)有限公司 通用文本信息提取方法、装置、计算机设备和存储介质
CN111859977B (zh) * 2019-06-06 2024-06-07 北京嘀嘀无限科技发展有限公司 一种语义分析方法、装置、电子设备及存储介质
CN110413749B (zh) * 2019-07-03 2023-06-20 创新先进技术有限公司 确定标准问题的方法及装置
CN110502745B (zh) * 2019-07-18 2023-04-07 平安科技(深圳)有限公司 文本信息评价方法、装置、计算机设备和存储介质
CN110674633A (zh) * 2019-09-18 2020-01-10 平安科技(深圳)有限公司 文书评审的校对方法及装置、存储介质、电子设备
CN110737646A (zh) * 2019-10-21 2020-01-31 北京明略软件系统有限公司 数据标注方法、装置、设备及可读存储介质
CN110765778B (zh) * 2019-10-23 2023-08-29 北京锐安科技有限公司 一种标签实体处理方法、装置、计算机设备和存储介质
CN110826313A (zh) * 2019-10-31 2020-02-21 北京声智科技有限公司 一种信息提取方法、电子设备及计算机可读存储介质
CN111144127B (zh) * 2019-12-25 2023-07-25 科大讯飞股份有限公司 文本语义识别方法及其模型的获取方法及相关装置
CN111159377B (zh) * 2019-12-30 2023-06-30 深圳追一科技有限公司 属性召回模型训练方法、装置、电子设备以及存储介质
CN111368024A (zh) * 2020-02-14 2020-07-03 深圳壹账通智能科技有限公司 文本语义相似度的分析方法、装置及计算机设备
CN111582497A (zh) * 2020-04-27 2020-08-25 平安医疗健康管理股份有限公司 训练文件生成及评价方法、装置、计算机系统及存储介质
CN111783424B (zh) * 2020-06-17 2024-02-13 泰康保险集团股份有限公司 一种文本分句方法和装置
CN114065751A (zh) * 2020-08-07 2022-02-18 阿里巴巴集团控股有限公司 申报要素抽取方法及装置和抽取模型生成方法及装置
CN112528671A (zh) * 2020-12-02 2021-03-19 北京小米松果电子有限公司 语义分析方法、装置以及存储介质
CN112579444B (zh) * 2020-12-10 2024-05-07 华南理工大学 基于文本认知的自动分析建模方法、系统、装置及介质
CN112733551A (zh) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 文本分析方法、装置、电子设备及可读存储介质
CN113051910B (zh) * 2021-03-19 2023-05-26 上海森宇文化传媒股份有限公司 一种用于预测人物角色情绪的方法和装置
CN113157949A (zh) * 2021-04-27 2021-07-23 中国平安人寿保险股份有限公司 事件信息的抽取方法、装置、计算机设备及存储介质
CN113361644B (zh) * 2021-07-03 2024-05-14 上海理想信息产业(集团)有限公司 模型训练方法、电信业务特征信息提取方法、装置及设备
CN113609847B (zh) * 2021-08-10 2023-10-27 北京百度网讯科技有限公司 信息抽取方法、装置、电子设备及存储介质
CN115563951B (zh) * 2022-10-14 2024-07-05 美的集团(上海)有限公司 文本序列的标注方法、装置、存储介质和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697192B1 (en) * 2013-06-28 2017-07-04 Digital Reasoning Systems, Inc. Systems and methods for construction, maintenance, and improvement of knowledge representations
CN107766320A (zh) * 2016-08-23 2018-03-06 中兴通讯股份有限公司 一种中文代词消解模型建立方法及装置
CN107894981A (zh) * 2017-12-13 2018-04-10 武汉烽火普天信息技术有限公司 一种案件语义要素的自动抽取方法
CN109766540A (zh) * 2018-12-10 2019-05-17 平安科技(深圳)有限公司 通用文本信息提取方法、装置、计算机设备和存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794169B (zh) * 2015-03-30 2018-11-20 明博教育科技有限公司 一种基于序列标注模型的学科术语抽取方法及系统
CN105930411A (zh) * 2016-04-18 2016-09-07 苏州大学 一种分类器训练方法、分类器和情感分类系统
CN108268875B (zh) * 2016-12-30 2020-12-08 广东精点数据科技股份有限公司 一种基于数据平滑的图像语义自动标注方法及装置
CN107423286A (zh) * 2017-07-05 2017-12-01 华中师范大学 初等数学代数型题自动解答的方法与系统
CN107451295B (zh) * 2017-08-17 2020-06-30 四川长虹电器股份有限公司 一种基于文法网络获取深度学习训练数据的方法
CN108255602B (zh) * 2017-11-01 2020-11-27 平安普惠企业管理有限公司 任务组合方法及终端设备
CN108492118B (zh) * 2018-04-03 2020-09-29 电子科技大学 汽车售后服务质量评价回访文本数据的两阶段抽取方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9697192B1 (en) * 2013-06-28 2017-07-04 Digital Reasoning Systems, Inc. Systems and methods for construction, maintenance, and improvement of knowledge representations
CN107766320A (zh) * 2016-08-23 2018-03-06 中兴通讯股份有限公司 一种中文代词消解模型建立方法及装置
CN107894981A (zh) * 2017-12-13 2018-04-10 武汉烽火普天信息技术有限公司 一种案件语义要素的自动抽取方法
CN109766540A (zh) * 2018-12-10 2019-05-17 平安科技(深圳)有限公司 通用文本信息提取方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG, WEI ET AL.: "Design and Implementation of Geographical Event Information Extraction based on Gate Framework", MODERN SURVEYING AND MAPPING, vol. 38, no. 4, 31 July 2015 (2015-07-31), ISSN: 1672-4097 *

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111754352A (zh) * 2020-06-22 2020-10-09 平安资产管理有限责任公司 一种观点语句正确性的判断方法、装置、设备和存储介质
CN111797629A (zh) * 2020-06-23 2020-10-20 平安医疗健康管理股份有限公司 医疗文本数据的处理方法、装置、计算机设备和存储介质
CN111814487B (zh) * 2020-07-17 2024-05-31 科大讯飞股份有限公司 一种语义理解方法、装置、设备及存储介质
CN111814487A (zh) * 2020-07-17 2020-10-23 科大讯飞股份有限公司 一种语义理解方法、装置、设备及存储介质
CN111931515A (zh) * 2020-08-10 2020-11-13 鼎富智能科技有限公司 基于合同纠纷判决书的合同条款效力分析方法及装置
CN111966807A (zh) * 2020-08-18 2020-11-20 中国银行股份有限公司 问答系统的文本处理方法及装置
CN112016451A (zh) * 2020-08-27 2020-12-01 贵州师范大学 一种用于迁移学习的训练样本标注成本削减方法
CN112036179A (zh) * 2020-08-28 2020-12-04 南京航空航天大学 基于文本分类与语义框架的电力预案信息抽取方法
CN112036179B (zh) * 2020-08-28 2024-03-26 南京航空航天大学 基于文本分类与语义框架的电力预案信息抽取方法
CN112069319A (zh) * 2020-09-10 2020-12-11 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112069319B (zh) * 2020-09-10 2024-03-22 杭州中奥科技有限公司 文本抽取方法、装置、计算机设备和可读存储介质
CN112307908A (zh) * 2020-10-15 2021-02-02 武汉科技大学城市学院 一种视频语义提取方法及装置
CN112307908B (zh) * 2020-10-15 2022-07-26 武汉科技大学城市学院 一种视频语义提取方法及装置
CN112269884B (zh) * 2020-11-13 2024-03-05 北京百度网讯科技有限公司 信息抽取方法、装置、设备及存储介质
CN112269884A (zh) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 信息抽取方法、装置、设备及存储介质
CN112329427B (zh) * 2020-11-26 2023-08-08 北京百度网讯科技有限公司 短信样本的获取方法和装置
CN112329427A (zh) * 2020-11-26 2021-02-05 北京百度网讯科技有限公司 短信样本的获取方法和装置
CN112507702B (zh) * 2020-12-03 2023-08-22 北京百度网讯科技有限公司 文本信息的抽取方法、装置、电子设备及存储介质
CN112507702A (zh) * 2020-12-03 2021-03-16 北京百度网讯科技有限公司 文本信息的抽取方法、装置、电子设备及存储介质
CN112560497B (zh) * 2020-12-10 2024-02-13 中国科学技术大学 语义理解方法、装置、电子设备和存储介质
CN112560497A (zh) * 2020-12-10 2021-03-26 科大讯飞股份有限公司 语义理解方法、装置、电子设备和存储介质
CN113823271A (zh) * 2020-12-18 2021-12-21 京东科技控股股份有限公司 语音分类模型的训练方法、装置、计算机设备及存储介质
CN112613501A (zh) * 2020-12-21 2021-04-06 深圳壹账通智能科技有限公司 信息审核分类模型的构建方法和信息审核方法
CN112699688A (zh) * 2021-01-08 2021-04-23 北京理工大学 一种篇章关系可控的文本生成方法和系统
CN113268601A (zh) * 2021-03-02 2021-08-17 安徽淘云科技股份有限公司 信息提取方法、阅读理解模型训练方法及相关装置
CN113268601B (zh) * 2021-03-02 2024-05-14 安徽淘云科技股份有限公司 信息提取方法、阅读理解模型训练方法及相关装置
CN113111650A (zh) * 2021-04-16 2021-07-13 中国工商银行股份有限公司 文本处理方法、装置、系统及存储介质
CN113222149A (zh) * 2021-05-31 2021-08-06 联仁健康医疗大数据科技股份有限公司 模型训练方法、装置、设备和存储介质
CN113222149B (zh) * 2021-05-31 2024-04-26 联仁健康医疗大数据科技股份有限公司 模型训练方法、装置、设备和存储介质
CN113408296A (zh) * 2021-06-24 2021-09-17 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN113408296B (zh) * 2021-06-24 2024-02-13 东软集团股份有限公司 一种文本信息提取方法、装置及设备
CN113487617A (zh) * 2021-07-26 2021-10-08 推想医疗科技股份有限公司 数据处理方法、装置、电子设备以及存储介质
CN113808758A (zh) * 2021-08-31 2021-12-17 联仁健康医疗大数据科技股份有限公司 一种检验数据标准化的方法、装置、电子设备和存储介质
CN113808758B (zh) * 2021-08-31 2024-06-07 联仁健康医疗大数据科技股份有限公司 一种检验数据标准化的方法、装置、电子设备和存储介质
CN113806492B (zh) * 2021-09-30 2024-02-06 中国平安人寿保险股份有限公司 基于语义识别的记录生成方法、装置、设备及存储介质
CN113806492A (zh) * 2021-09-30 2021-12-17 中国平安人寿保险股份有限公司 基于语义识别的记录生成方法、装置、设备及存储介质
CN114020877B (zh) * 2021-11-18 2024-05-10 中科雨辰科技有限公司 一种用于标注文本的数据处理系统
CN114020877A (zh) * 2021-11-18 2022-02-08 中科雨辰科技有限公司 一种用于标注文本的数据处理系统
CN114119976A (zh) * 2021-11-30 2022-03-01 广州文远知行科技有限公司 语义分割模型训练、语义分割的方法、装置及相关设备
CN114119976B (zh) * 2021-11-30 2024-05-14 广州文远知行科技有限公司 语义分割模型训练、语义分割的方法、装置及相关设备
CN115495541A (zh) * 2022-11-18 2022-12-20 深译信息科技(珠海)有限公司 语料数据库、语料数据库的维护方法、装置、设备和介质
CN115879421B (zh) * 2023-02-16 2024-01-09 之江实验室 一种增强bart预训练任务的句子排序方法及装置
CN115879421A (zh) * 2023-02-16 2023-03-31 之江实验室 一种增强bart预训练任务的句子排序方法及装置

Also Published As

Publication number Publication date
CN109766540A (zh) 2019-05-17
CN109766540B (zh) 2022-05-03

Similar Documents

Publication Publication Date Title
WO2020119075A1 (fr) Procédé et appareil d'extraction d'informations de texte général, dispositif informatique et support d'informations
US20230196127A1 (en) Method and device for constructing legal knowledge graph based on joint entity and relation extraction
JP7228662B2 (ja) イベント抽出方法、装置、電子機器及び記憶媒体
TWI636452B (zh) 語音識別方法及系統
US20230142217A1 (en) Model Training Method, Electronic Device, And Storage Medium
JP6909832B2 (ja) オーディオにおける重要語句を認識するための方法、装置、機器及び媒体
US8903707B2 (en) Predicting pronouns of dropped pronoun style languages for natural language translation
Orosz et al. PurePos 2.0: a hybrid tool for morphological disambiguation
CN111931517B (zh) 文本翻译方法、装置、电子设备以及存储介质
US10140272B2 (en) Dynamic context aware abbreviation detection and annotation
US20180011830A1 (en) Annotation Assisting Apparatus and Computer Program Therefor
US20180189284A1 (en) System and method for dynamically creating a domain ontology
CN109460552B (zh) 基于规则和语料库的汉语语病自动检测方法及设备
TW202020691A (zh) 特徵詞的確定方法、裝置和伺服器
WO2024207587A1 (fr) Procédé de notation de réponse aux questions, appareil de notation de réponse aux questions, dispositif électronique et support de stockage
CN109472022B (zh) 基于机器学习的新词识别方法及终端设备
WO2021129123A1 (fr) Procédé et appareil de traitement de données de corpus, serveur et support de stockage
WO2020199600A1 (fr) Procédé d'analyse de polarité de sentiment et dispositif associé
WO2021068684A1 (fr) Procédé et appareil de génération automatique de répertoire de document, dispositif informatique et support de stockage
TW201403354A (zh) 以資料降維法及非線性算則建構中文文本可讀性數學模型之系統及其方法
Rodrigues et al. Advanced applications of natural language processing for performing information extraction
WO2023184633A1 (fr) Procédé et système de correction d'erreur de l'orthographe chinoise, support de stockage et terminal
CN111160041A (zh) 语义理解方法、装置、电子设备和存储介质
CN113282762A (zh) 知识图谱构建方法、装置、电子设备和存储介质
CN108268443B (zh) 确定话题点转移以及获取回复文本的方法、装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.10.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19896096

Country of ref document: EP

Kind code of ref document: A1