WO2021139239A1 - Mechanism entity extraction method, system and device based on multiple training targets - Google Patents

Mechanism entity extraction method, system and device based on multiple training targets Download PDF

Info

Publication number
WO2021139239A1
WO2021139239A1 PCT/CN2020/118331 CN2020118331W WO2021139239A1 WO 2021139239 A1 WO2021139239 A1 WO 2021139239A1 CN 2020118331 W CN2020118331 W CN 2020118331W WO 2021139239 A1 WO2021139239 A1 WO 2021139239A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector feature
feature set
training sample
text information
entity
Prior art date
Application number
PCT/CN2020/118331
Other languages
French (fr)
Chinese (zh)
Inventor
柴玲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139239A1 publication Critical patent/WO2021139239A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of information extraction, and in particular to a method, system, device and storage medium for extracting institutional entities based on multiple training targets.
  • This application provides a method, system, electronic device, and computer storage medium for extracting institutional entities based on multiple training targets.
  • the main purpose of the method is to solve the problem of low efficiency and poor quality of existing institutional entity extraction methods.
  • this application provides a method for extracting institutional entities based on multiple training objectives, the method including the following steps:
  • the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
  • Extracting relevant institutional entities in the text information to be detected according to the sequence annotations Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  • this application also provides an institutional entity extraction system based on multiple training targets, the system including:
  • the sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity
  • the model training unit is used to train a preset named entity model using the marked training sample set, so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road.
  • the main road the first main road is used to extract a first vector feature set of the input text information
  • the second main road is used to extract a second vector feature set of the input text information
  • the second main road The path is also used to perform sequence labeling on the input text information according to the first vector feature set and the second vector feature set;
  • the model application unit is configured to perform sequence labeling on the acquired text information to be detected through the named entity model
  • the institution entity extraction unit is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
  • the present application also provides an electronic device, the electronic device comprising: a memory, a processor, and an organization entity extraction based on multiple training targets that is stored in the memory and can run on the processor A program, when the multi-training target-based organization entity extraction program is executed by the processor, the following steps are implemented:
  • the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
  • Extracting relevant institutional entities in the text information to be detected according to the sequence annotations Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  • the present application also provides a computer-readable storage medium in which a multi-training target-based organization entity extraction program is stored, and the multi-training target-based organization entity extraction program is stored in the computer-readable storage medium.
  • the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
  • Extracting relevant institutional entities in the text information to be detected according to the sequence annotations Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  • This application can effectively avoid error propagation.
  • the named entity model of multiple training targets designed in this application strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries. It is much more stable than the traditional NER model.
  • FIG. 1 is a flowchart of a preferred embodiment of a method for extracting institutional entities based on multiple training targets according to an embodiment of the present application;
  • FIG. 2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the internal logic of an organization entity extraction program based on multiple training targets according to an embodiment of the present application.
  • the technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology, and the data involved, such as training sample sets, can be stored in a database, or can be stored in a blockchain, such as through a blockchain Distributed storage is not limited in this application.
  • staged training model first train the named entity The extraction model identifies all institutional entities, "Southern Medical University” is marked as [B-ORG,I-ORG,I-ORG,I-ORG,E-ORG], and then the text classification model recognizes that the institutional entity belongs to the type of experience "Work experience (JOB)", “education experience (EDU)” or “short-term study experience (STU)”. But the obvious drawback of this solution is that the error of the first model will be transferred to the second model and amplify the error.
  • Another common solution is to train an end-to-end named entity extraction model, such as LSTM+CRF.
  • a unified tag for each entity "Southern Medical University” is directly labeled [B-EDU,I-EDU,I-EDU,I-EDU,I-EDU,E-EDU], "Zhongshan "University” is directly marked as [B-STU, I-STU, I-STU, E-STU], and the end-to-end named entity extraction model is trained. In this way, error propagation in the traditional scheme of separate training can be avoided.
  • FIG. 1 shows the flow of the method for extracting institutional entities based on multiple training objectives provided in this application.
  • the method for extracting institutional entities based on multiple training targets includes:
  • S110 Obtain a training sample set, and label each training sample in the training sample set with a named entity.
  • the sample here is a piece of text information that contains the entity of the institution. For example, it can be a paragraph from a job resume or a piece of text information on the homepage of a scholar on the Internet.
  • the named entity labeling method used in this application is the BIO labeling method, where B is used to label the beginning of the institutional entity, and I is used to label The institutional entity itself, O is used to mark information in the sample that is not related to the institutional entity.
  • this application needs to perform multiple types of annotations on each sample in the training sample set, including at least four types, such as: Boundary-tag, End-tag, Type-tage, and There are four types of unified-tag tags. Different types of tags have different labeling methods. Of course, the corresponding labeling functions are also different.
  • the Boundary-tag type is mainly used to label the boundaries of the institutional entities in the sample; the End-tag type is mainly used to label the samples in the sample.
  • the end position of the institutional entity; Type-tage is mainly used to mark the entity type of the institutional entity, such as graduate colleges, workplaces, internship places, and so on.
  • the unified-tag type is the final target tag. After the four types of labeling are completed, the sample is saved to the training sample set.
  • the training sample set can be stored in a node of the blockchain.
  • S120 Use the marked training sample set to train a preset named entity model so that the named entity model achieves a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road.
  • the main road is used to extract the first vector feature set of the input text information
  • the second main road is used to extract the second vector feature set of the input text information
  • the second main road is also used to extract the first vector feature set
  • the set and the second vector feature set perform sequence labeling on the input text information.
  • the named entity model is a new type of sequence labeling model designed by us, which combines the training links of multiple targets; specifically, the named entity model includes two main roads, and the first main road is set There is a first neural network model LSTM1, the first main road extracts the first vector feature set of the input text information (corresponding to the training sample or the later text information to be detected) through the first neural network model LSTM1; the second main road A second neural network model LSTM2 is set inside, and the second main road extracts the second vector feature set of the input text information through the second neural network model LSTM1.
  • the first trunk road branches into a first branch and a second branch, and a first predictor is provided in the first branch, and the first predictor is used to label according to the Boundary-tag label type
  • the entity boundary of the first vector feature set a second predictive classifier is arranged in the second branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag tagging type.
  • a set of corresponding first vector feature sets will be output, denoted as h1, and then the first vector feature set h1 will be sent to the first branch and the second branch at the same time.
  • the first branch corresponds to the first predictive classifier, and is used to label the entity boundary y boundary_tag according to the first vector feature set h1 of the text information, corresponding to the Boundary-tag label
  • the second branch corresponds to the second predictive classifier , Used to mark the end position y end_tag according to the first vector feature set h1 of the text information, corresponding to the End-tag label.
  • the second main road branches into the third branch and the final output branch; wherein, a third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the type according to the Type-tage
  • the entity type of the first vector feature set is marked; the final output branch is provided with a total prediction classifier, and the total prediction classifier is used according to the first vector feature set, the second vector feature set, and the unified-tag
  • the label type marks the final label of the input text information.
  • the second vector feature set h2 is simultaneously transmitted to the third branch and the final output branch, respectively, where the third branch corresponds to the third predictor, which is used according to the second input text information.
  • the vector feature set h2 marks its entity type y type_tage , corresponding to Type-tage;
  • the final output branch corresponds to the total prediction classifier (SC-BG), which is used for the first vector feature set h1 and the second vector feature of the input text information Set h2 will finally mark the input text information y unified_tag , corresponding to unified-tag.
  • SC-BG total prediction classifier
  • LSTM (including LSTM1 and LSTM2) is an existing commonly used neural network model, and its specific structure is the existing technology, and will not be repeated here.
  • a set of vector features (h1 or h2) of the input text information can be obtained.
  • LSTM needs to be used with a predictive classifier.
  • the LSTM and the predictive classifier connected to it After the training is completed, the feature vector output by the LSTM will generate the required association with each predictive classifier, and the association can be modeled
  • the training parameter W 1 indicates that when W 1 reaches the preset accuracy, the feature vector output by the LSTM is the required feature vector.
  • the accuracy of the named entity model can reach the required preset accuracy.
  • the vector feature h1 extracted by the named entity model The required connection is generated with Boundary-tag and End-tage
  • the vector feature h2 extracted by the named entity model is generated with the required connection with Type-tage.
  • an activation function is set in the first predictive classifier, the second predictive classifier, and the third predictive classifier. After the first vector feature set or the second vector feature set passes the activation function Implement the labeling of the first vector feature set or the second vector feature set; wherein, the calculation process of the activation function is as follows:
  • W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier; Refers to the first vector feature set or the second vector feature set, Is the output labeling result; the Softmax function is a normalized function, which is used to The value of is mapped to the interval (0,1):
  • [x 1 ,x 2 ,...,x i ,x n ] is an array, which refers to f(i) is the softmax value of the i-th element.
  • the LSTM layer in the first trunk road of the front-end of the named entity model is denoted as LSTM1
  • the first vector feature set is output, denoted as h1
  • the output vector at time t is denoted as After the Softmax activation function, it is used to predict the boundary-tag label.
  • the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine” should be "Bi(O) ⁇ (O) ⁇ (O) ⁇ (B) ⁇ (I)) Jiao (I) Tong (I) University (I) School (I) Medicine (O) School (O) Hospital (O)”.
  • the output is denoted as y boundary_tag .
  • Wb is a parameter that needs to be trained in the first predictive classifier in the model.
  • the Softmax function is a normalized function, which is The value of is mapped to the interval (0,1):
  • [x 1 ,x 2 ,...,x i ,x n ] is an array, which refers to f(i) is the softmax value of the i-th element.
  • the second predictive classifier is used to predict the end-tag label, that is, whether the corresponding text information is predicted to be 0 (non-entity end position) or 1 (entity end position); for example, "Graduated from Shanghai Jiaotong University School of Medicine”
  • the corresponding output is "Bi (0) industry (0) on (0) (0) sea (0) cross (0) pass (0) university (0) school (1) medicine (0) school (0) hospital (0)”.
  • the output is recorded as y end_tag .
  • W e is the prediction model
  • a second classifier parameter training needs.
  • the output h1 of LSTM1 in the first trunk road can learn the boundary-tag and end -tag two classification features.
  • the LSTM layer in the second trunk road at the front end of the named entity model is denoted as LSTM2.
  • the second vector feature set is output, denoted as h2, and the output vector at time t is denoted as Use the softmax function to predict the type_tag label, that is, predict the corresponding classification type as JOB (work unit), EDU (unit of education experience), etc.; for example, the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine” should be "Complete (O)" Profession (O) on (O) (EDU) sea (EDU) cross (EDU) communication (EDU) university (EDU) school (EDU) medical (O) school (O) hospital (O)". Record it as y type_tage .
  • the prediction components BG boundary guide
  • SC sentiment consistency
  • the prediction components BG boundary guide
  • SC sentiment consistency
  • Data and internal hidden features to get the final prediction result, corresponding to the unified-tag classification label for example: "Graduated from Shanghai Jiaotong University School of Medicine", the corresponding output is "Bi (O) Industry (O) on (O) on (B) -EDU) Sea (I-EDU) Cross (I-EDU) Tong (I-EDU) University (I-EDU) School (I-EDU) Medical (O) School (O) Hospital (O)", here is For the final target label.
  • the final output is recorded as y unified_tag .
  • the total prediction classifier is provided with a first prediction component SC and a second prediction component BG; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature set The relationship between the current vector feature and the previous moment feature;
  • the second prediction component BG is used to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
  • the input is h2
  • the output is a set of vector feature sets, denoted as h3
  • the output vector at time t is denoted as The calculation process
  • the sigmoid function is as follows:
  • BG contains a boundary-tag to unified-tag transition matrix W tr ,
  • Bi is the unified-tag tag set ⁇ B-EUD, I-EDU, B-STU, I-STU, O,... ⁇ , and
  • z b is the intermediate parameter in the first predictive classifier (refer to the specific embodiment for the first predictive classifier), which is calculated by z b itself
  • is the predicted hyperparameter
  • W 1 (including W b W e W t W tr ) will also change accordingly, and get closer and closer to the optimal value.
  • W 1 will generally be near the optimal value.
  • the named entity model can be used to extract institutional entity information from the text information to be extracted.
  • S130 Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model.
  • the Internet or a database can be used to obtain the text information to be detected related to the perpetrator, such as personal resume information, personal homepage information, and so on.
  • the corresponding annotation sequence will be output at the four output terminals, including y boundary_tag , y end_tag , y type_tage and y unified_tag , because y unified_tag already contains y boundary_tag , y end_tag , y type_tage related sequence characteristic information, therefore, in practical applications, only need to obtain the y unified_tag sequence annotation of the text information to be detected.
  • S140 Extract relevant institutional entities in the text information to be detected according to the sequence labeling.
  • the relevant institutional entity of the actor in the text information to be detected is extracted.
  • the y unified_tag sequence label contains the relevant characteristics of the y end_tag sequence label, the end position of the required institutional entity can be accurately determined according to the y unified_tag sequence label to avoid the problem of inaccurate institutional entity positioning.
  • the y type_tage sequence labeling in the y unified_tag sequence labeling it is possible to accurately determine the category of the institutional entity according to the y unified_tag sequence labeling as "work experience”, "educational experience” or "short-term learning experience”.
  • the method for extracting institutional entities based on multiple training targets proposed in this application by designing a named entity model for multi-target training, can be compared with the traditional separately trained named entity extraction model and text classification model. Effectively avoid error propagation.
  • the named entity model of multiple training targets designed in this application Strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries, which is much more stable than the traditional NER model.
  • this application also provides an institutional entity extraction system based on multiple training targets, the system including:
  • the sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity
  • the model training unit is used to train the preset named entity model using the marked training sample set to make the named entity model reach the preset accuracy; wherein, the named entity model includes the first trunk road and the second trunk road , The first main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set of the input text information; The first vector feature set and the second vector feature set to sequence the input text information;
  • the model application unit is used to obtain the text information to be detected, and to perform sequence labeling on the text information to be detected through the named entity model;
  • the institution entity extraction unit is used to extract the relevant institution entity in the text information to be detected according to the sequence label.
  • the application also provides an electronic device 70.
  • FIG. 2 this figure is a schematic structural diagram of a preferred embodiment of the electronic device 70 provided by this application.
  • the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • a computing function such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
  • the electronic device 70 includes a processor 71 and a memory 72.
  • the memory 72 includes at least one type of readable storage medium.
  • the at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like.
  • the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70.
  • the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the readable storage medium of the memory 72 is generally used to store the multi-training target-based institutional entity extraction program 73 installed in the electronic device 70.
  • the memory 72 can also be used to temporarily store data that has been output or will be output.
  • the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run the program code or process data stored in the memory 72, for example based on multiple training targets.
  • the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.
  • FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the electronic device 70 may also include a user interface.
  • the user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc.
  • the user interface may also include a standard wired interface and a wireless interface.
  • the electronic device 70 may also include a display, and the display may also be referred to as a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like.
  • the display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.
  • the electronic device 70 may also include a touch sensor.
  • the area provided by the touch sensor for the user to perform touch operations is called the touch area.
  • the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like.
  • the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like.
  • the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
  • the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor.
  • the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
  • the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
  • RF radio frequency
  • the memory 72 which is a computer storage medium, may include an operating system and an entity extraction program 73 based on multiple training targets; the processor 71 executes the memory 72 to store information based on multiple training targets.
  • the following steps are implemented in the entity extraction procedure 73:
  • the named entity model includes a first trunk road and a second trunk road
  • the first trunk road Is used to extract the first vector feature set of input text information
  • the second main road is used to extract the second vector feature set of the input text information
  • the second main road is also used to extract the first vector feature set and Perform sequence labeling on the input text information with the second vector feature set
  • FIG. 3 is an internal logic diagram of an organization entity extraction program based on multiple training targets according to an embodiment of the present application.
  • the organization entity extraction program 73 based on multiple training targets can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.
  • FIG. 3 it is a program module diagram of a preferred embodiment of the organization entity extraction program 73 based on multiple training targets in FIG.
  • the organization entity extraction program 73 based on multiple training targets can be divided into: a sample labeling module 74, a model training module 75, a model application module 76, and an organization entity extraction module 77.
  • the functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:
  • the sample labeling module 74 is used to obtain a training sample set and label each training sample in the training sample set with a named entity;
  • the model training module 75 is configured to use the marked training sample set to train a preset named entity model so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk
  • the first main road is used to extract the first vector feature set of the input text information
  • the second main road is used to extract the second vector feature set of the input text information
  • the second main road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information with the first vector feature set and the second vector feature set;
  • the model application module 76 is used to obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;
  • the institution entity extraction module 77 is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores an organization entity extraction program 73 based on multiple training targets.
  • the organization entity extraction program 73 based on multiple training goals is executed by a processor, the following operations are implemented:
  • the named entity model includes a first trunk road and a second trunk road
  • the first trunk road Is used to extract the first vector feature set of input text information
  • the second main road is used to extract the second vector feature set of the input text information
  • the second main road is also used to extract the first vector feature set and Perform sequence labeling on the input text information with the second vector feature set
  • the computer-readable storage medium may be non-volatile or volatile.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A mechanism entity extraction method based on multiple training targets. The method comprises: acquiring a training sample set, and carrying out named entity labeling on each training sample in the training sample set; using the labeled training sample set to train a preset named entity model, such that the named entity model reaches a preset precision; carrying out, by means of the named entity model, sequence labeling on acquired text information to be detected; and according to sequence labeling, extracting a related mechanism entity from the text information to be detected. The present invention further relates to blockchain technology. The training sample set is stored in a blockchain. The method can effectively solve the problems of the low efficiency and poor quality of an existing mechanism entity extraction method.

Description

基于多训练目标的机构实体抽取方法、系统及装置Organization entity extraction method, system and device based on multiple training targets
本申请要求于2020年7月28日提交中国专利局、申请号为202010738252.X,发明名称为“基于多训练目标的机构实体抽取方法、系统及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 28, 2020, the application number is 202010738252.X, and the invention title is "Method, System and Device for Extracting Institutional Entity Based on Multiple Training Objectives", all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及信息提取技术领域,尤其涉及一种基于多训练目标的机构实体抽取方法、系统、装置及存储介质。This application relates to the technical field of information extraction, and in particular to a method, system, device and storage medium for extracting institutional entities based on multiple training targets.
背景技术Background technique
当前很多学术学者库如AMINER、ORCID等,提供学者信息,方便用户去追踪某一学者或其所在团队的研究方向与进展。比如,一些专家团队项目深耕医学科研领域,致力于搭建一个医学领域的专家库,构建完整的专家知识图谱。At present, many academic scholar libraries, such as AMINER, ORCID, etc., provide scholar information so that users can track the research direction and progress of a scholar or his team. For example, some expert team projects are deeply involved in the field of medical research, and are committed to building an expert database in the medical field and building a complete expert knowledge map.
然而,在专家知识图谱的构建过程中,建立专家与机构的关系网络是一个有价值且有难度的工作,原因在于专家是一个有自我选择能力的行为人,他是会在各机构之间进行流动;比如,专家A可能在机构A读博士、在机构B工作,期间又在机构C进修过。然而,目前的常见的学者库(如AMINER、ORCID等)一般仅提供学者的当前所在机构。实际上,发明人意识到,一个学者的完整的科研画像与他曾经待过的机构是息息相关的。However, in the process of constructing the expert knowledge graph, establishing a network of experts and institutions is a valuable and difficult task. The reason is that an expert is a self-selective actor who will conduct work among institutions. Mobility; for example, expert A may study a doctorate in institution A, work in institution B, and have studied in institution C during the period. However, the current common scholar libraries (such as AMINER, ORCID, etc.) generally only provide the scholar's current institution. In fact, the inventor realized that a scholar's complete scientific research portrait is closely related to the institution he once stayed in.
显然,通过人力去搜集数十万专家(以中国医疗相关学者为例)的教育经历、工作经历、进修经历中涉及的机构是不现实的。因此,发明人意识到可以考虑从互联网中获取到学者主页上大段的文本信息。所以该问题即变成如何使用人工智能的算法从糅杂的文本信息中提取与该学者相关的结构化的知识。Obviously, it is unrealistic to collect hundreds of thousands of experts (taking Chinese medical-related scholars as an example) of educational, work, and advanced education experiences through human resources. Therefore, the inventor realizes that a large section of text information on the scholar's homepage can be obtained from the Internet. So the question becomes how to use artificial intelligence algorithms to extract structured knowledge related to the scholar from the mixed text information.
如某互联网中的某学者的相关简介字段为“1990年6月毕业后,到广州医学院第一附属医院从事肿瘤血液内科工作,2008年6月取得南方医科大学临床医学博士学位。2008年12月至2009年5月在中山大学肿瘤防治中心进修。”,则需要从中抽取出三个机构实体为“广州医学院第一附属医院”、“南方医科大学”以及“中山大学”,并且识别出三者分别属于“工作经历”、“教育经历”、“短期进修经历”。For example, the relevant profile field of a scholar on the Internet is "After graduating in June 1990, he went to the First Affiliated Hospital of Guangzhou Medical College to work in the Department of Oncology and Hematology, and obtained a doctorate in clinical medicine from Southern Medical University in June 2008. December 2008 From January to May 2009, he studied at the Sun Yat-sen University Cancer Center.", you need to extract three institutional entities from the "First Affiliated Hospital of Guangzhou Medical College", "Southern Medical University" and "Sun Yat-Sen University", and identify The three belong to "work experience", "education experience", and "short-term study experience".
基于以上问题,亟需一种高效且高质的机构实体抽取方法。Based on the above problems, an efficient and high-quality method for extracting institutional entities is urgently needed.
发明内容Summary of the invention
本申请提供一种基于多训练目标的机构实体抽取方法、系统、电子装置以及计算机存储介质,其主要目的在于解决现有的机构实体抽取方法效率低质量差的问题。This application provides a method, system, electronic device, and computer storage medium for extracting institutional entities based on multiple training targets. The main purpose of the method is to solve the problem of low efficiency and poor quality of existing institutional entity extraction methods.
为实现上述目的,本申请提供一种基于多训练目标的机构实体抽取方法,该方法包括如下步骤:In order to achieve the above objective, this application provides a method for extracting institutional entities based on multiple training objectives, the method including the following steps:
获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
另外,本申请还提供一种基于多训练目标的机构实体抽取系统,所述系统包括:In addition, this application also provides an institutional entity extraction system based on multiple training targets, the system including:
样本标注单元,用于获取训练样本集并对所述训练样本集内的各训练样本进行命名实体标注;The sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity;
模型训练单元,用于使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干 路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;The model training unit is used to train a preset named entity model using the marked training sample set, so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. The main road, the first main road is used to extract a first vector feature set of the input text information, the second main road is used to extract a second vector feature set of the input text information; and, the second main road The path is also used to perform sequence labeling on the input text information according to the first vector feature set and the second vector feature set;
模型应用单元,用于通过所述命名实体模型对获取的待检测文本信息进行序列标注;The model application unit is configured to perform sequence labeling on the acquired text information to be detected through the named entity model;
机构实体提取单元,用于根据所述序列标注提取所述待检测文本信息中的相关机构实体。The institution entity extraction unit is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
另外,为实现上述目的,本申请还提供一种电子装置,该电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的基于多训练目标的机构实体抽取程序,所述基于多训练目标的机构实体抽取程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above object, the present application also provides an electronic device, the electronic device comprising: a memory, a processor, and an organization entity extraction based on multiple training targets that is stored in the memory and can run on the processor A program, when the multi-training target-based organization entity extraction program is executed by the processor, the following steps are implemented:
获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
另外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有基于多训练目标的机构实体抽取程序,所述基于多训练目标的机构实体抽取程序被处理器执行时,实现以下步骤:In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium in which a multi-training target-based organization entity extraction program is stored, and the multi-training target-based organization entity extraction program is stored in the computer-readable storage medium. When executed by the processor, the following steps are implemented:
获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
本申请能够有效地避免误差传播,另外,本申请设计的多个训练目标的命名实体模型,强化了边界特征和语义特征的抽取,能够显著提高最终的预测精度,尤其是对边界的抓取,比传统的NER模型稳定得多。This application can effectively avoid error propagation. In addition, the named entity model of multiple training targets designed in this application strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries. It is much more stable than the traditional NER model.
附图说明Description of the drawings
图1为根据本申请实施例的基于多训练目标的机构实体抽取方法的较佳实施例流程图;FIG. 1 is a flowchart of a preferred embodiment of a method for extracting institutional entities based on multiple training targets according to an embodiment of the present application;
图2为根据本申请实施例的电子装置的较佳实施例结构示意图;2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present application;
图3为根据本申请实施例的基于多训练目标的机构实体抽取程序的内部逻辑示意图。Fig. 3 is a schematic diagram of the internal logic of an organization entity extraction program based on multiple training targets according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
在下面的描述中,出于说明的目的,为了提供对一个或多个实施例的全面理解,阐述了许多具体细节。然而,很明显,也可以在没有这些具体细节的情况下实现这些实施例。In the following description, for illustrative purposes, in order to provide a comprehensive understanding of one or more embodiments, many specific details are set forth. However, it is obvious that these embodiments can also be implemented without these specific details.
本申请的技术方案可应用于人工智能、区块链和/或大数据技术领域,涉及的数据如训练样本集等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology, and the data involved, such as training sample sets, can be stored in a database, or can be stored in a blockchain, such as through a blockchain Distributed storage is not limited in this application.
在具体介绍本申请提供的基于多训练目标的机构实体抽取方法的实施例之前,需要说明的是,传统的实体抽取选用的方法主要有两种,一种是分阶段训练模型:首先训练命名 实体提取模型识别出所有机构实体,“南方医科大学”被标注为[B-ORG,I-ORG,I-ORG,I-ORG,E-ORG],然后文本分类模型识别该机构实体所属经历类型为“工作经历(JOB)”、“教育经历(EDU)”还是“短期进修经历(STU)”。但这种解决方案很明显的缺陷是第一个模型的误差会传递到第二个模型并且放大误差。Before specifically introducing the embodiment of the multi-training target-based institutional entity extraction method provided by this application, it should be noted that there are mainly two methods for traditional entity extraction. One is the staged training model: first train the named entity The extraction model identifies all institutional entities, "Southern Medical University" is marked as [B-ORG,I-ORG,I-ORG,I-ORG,E-ORG], and then the text classification model recognizes that the institutional entity belongs to the type of experience "Work experience (JOB)", "education experience (EDU)" or "short-term study experience (STU)". But the obvious drawback of this solution is that the error of the first model will be transferred to the second model and amplify the error.
另外一种常见的解决方案为训练端到端的命名实体提取模型,如LSTM+CRF等。对每个实体进行统一的标注(unified tag),“南方医科大学”直接被标注为[B-EDU,I-EDU,I-EDU,I-EDU,I-EDU,E-EDU],“中山大学”直接标注为[B-STU,I-STU,I-STU,E-STU],训练端到端的命名实体提取模型。这样可以避免分别训练的传统方案中的误差传播。Another common solution is to train an end-to-end named entity extraction model, such as LSTM+CRF. A unified tag for each entity, "Southern Medical University" is directly labeled [B-EDU,I-EDU,I-EDU,I-EDU,I-EDU,E-EDU], "Zhongshan "University" is directly marked as [B-STU, I-STU, I-STU, E-STU], and the end-to-end named entity extraction model is trained. In this way, error propagation in the traditional scheme of separate training can be avoided.
但是LSTM+CRF简单的命名实体抽取模型仍无法很好地解决简介机构细抽取这一场景中特有的两个问题:一:同一个实体在不同语境下标签不同:“上海第六人民医院”既是医生A的“教育经历”机构,又是医生B的“工作经历”,还是医生C的“教育经历”以及“工作经历”。所以对上下文场景的信息捕捉难度高于一般的命名实体提取问题。二是边界问题,为保证输入的结构化知识的统一,对机构的提取均保留到独立单位的粒度(大学、医院等),比如“中山大学肿瘤防治中心”,我们希望最后的结果能识别到“中山大学”的级别同时忽略“肿瘤防治中心”,与此同时“北京肿瘤防治中心”又是一个独立的实体。显然,传统的端到端的命名实体提取模型做不到这一点,因此,亟需一种更加高效且高质的机构实体抽取方法。However, the simple named entity extraction model of LSTM+CRF still cannot solve the two problems unique to the scene of detailed extraction of profile agencies: one: the same entity has different labels in different contexts: "Shanghai Sixth People's Hospital" It is not only the "education experience" institution of doctor A, but also the "work experience" of doctor B, and the "education experience" and "work experience" of doctor C. Therefore, the difficulty of capturing the information of the context scene is higher than the general named entity extraction problem. The second is the boundary problem. In order to ensure the unification of the input structured knowledge, the extraction of institutions is reserved to the granularity of independent units (universities, hospitals, etc.), such as "Sun Yat-sen University Cancer Center". We hope that the final results can identify The level of "Sun Yat-Sen University" also ignores the "Cancer Center", while the "Beijing Cancer Center" is an independent entity. Obviously, the traditional end-to-end named entity extraction model cannot do this. Therefore, there is an urgent need for a more efficient and high-quality institutional entity extraction method.
以下将结合附图对本申请的具体实施例进行详细描述。The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.
实施例1Example 1
为了说明本申请提供的基于多训练目标的机构实体抽取方法,图1示出了根据本申请提供的基于多训练目标的机构实体抽取方法的流程。In order to illustrate the method for extracting institutional entities based on multiple training targets provided in this application, FIG. 1 shows the flow of the method for extracting institutional entities based on multiple training objectives provided in this application.
如图1所示,本申请提供的基于多训练目标的机构实体抽取方法,包括:As shown in Figure 1, the method for extracting institutional entities based on multiple training targets provided by this application includes:
S110:获取训练样本集,并对该训练样本集内的各训练样本进行命名实体标注。S110: Obtain a training sample set, and label each training sample in the training sample set with a named entity.
需要说明的是,此处的样本即为一段包含机构实体的一段文字信息,例如,可以是入职简历中的一段话,也可以是网络中的学者主页上的一段文字信息。It should be noted that the sample here is a piece of text information that contains the entity of the institution. For example, it can be a paragraph from a job resume or a piece of text information on the homepage of a scholar on the Internet.
具体地,在对该训练样本集内的各训练样本进行命名实体标注的过程中,本申请使用的命名实体标注方法为BIO标注方式,其中,B用于标注机构实体的开头,I用于标注机构实体本身,O用于标注样本中与机构实体不相关的信息。Specifically, in the process of labeling each training sample in the training sample set, the named entity labeling method used in this application is the BIO labeling method, where B is used to label the beginning of the institutional entity, and I is used to label The institutional entity itself, O is used to mark information in the sample that is not related to the institutional entity.
此外,为实现后期模型的多目标训练,本申请需要对训练样本集中的每一个样本进行多种类型的标注,至少包括四种类型,例如:Boundary-tag,End-tag,Type-tage,以及unified-tag四类标签,不同类型的标签的标注方式不同,当然对应的标注功能也不同,Boundary-tag类型主要用于标注样本中的机构实体边界;End-tag类型主要用于标注样本中的机构实体的结束位置;Type-tage类型主要用于标注机构实体的实体类型,比如,毕业院校、工作场所,实习场所等等。unified-tag类型为最终的目标标签,待四种类型的标注都完成后,将该样本保存至训练样本集。In addition, in order to achieve the multi-target training of the later model, this application needs to perform multiple types of annotations on each sample in the training sample set, including at least four types, such as: Boundary-tag, End-tag, Type-tage, and There are four types of unified-tag tags. Different types of tags have different labeling methods. Of course, the corresponding labeling functions are also different. The Boundary-tag type is mainly used to label the boundaries of the institutional entities in the sample; the End-tag type is mainly used to label the samples in the sample. The end position of the institutional entity; Type-tage is mainly used to mark the entity type of the institutional entity, such as graduate colleges, workplaces, internship places, and so on. The unified-tag type is the final target tag. After the four types of labeling are completed, the sample is saved to the training sample set.
另外,需要强调的是,为进一步保证上述该训练样本集内数据的私密和安全性,该训练样本集可以存储于区块链的节点中。In addition, it should be emphasized that, in order to further ensure the privacy and security of the data in the training sample set, the training sample set can be stored in a node of the blockchain.
S120:使用标注完成的训练样本集对预设的命名实体模型进行训练,以使该命名实体模型达到预设精度;其中,该命名实体模型包括第一主干路和第二主干路,该第一主干路用于提取输入文本信息的第一向量特征集,该第二主干路用于提取该输入文本信息的第二向量特征集;并且,该第二主干路还用于根据该第一向量特征集和该第二向量特征集对该输入文本信息进行序列标注。S120: Use the marked training sample set to train a preset named entity model so that the named entity model achieves a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. The main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set The set and the second vector feature set perform sequence labeling on the input text information.
需要说明的是,命名实体模型为自行设计的一个新型的序列标注模型,该模型结合了 多个目标的训练环节;具体地,该命名实体模型包括两条主干路,该第一主干路内设置有第一神经网络模型LSTM1,该第一主干路通过该第一神经网络模型LSTM1提取该输入文本信息(对应训练样本或后期的待检测文本信息)的第一向量特征集;该第二主干路内设置有第二神经网络模型LSTM2,该第二主干路通过该第二神经网络模型LSTM1提取该输入文本信息的第二向量特征集。It should be noted that the named entity model is a new type of sequence labeling model designed by ourselves, which combines the training links of multiple targets; specifically, the named entity model includes two main roads, and the first main road is set There is a first neural network model LSTM1, the first main road extracts the first vector feature set of the input text information (corresponding to the training sample or the later text information to be detected) through the first neural network model LSTM1; the second main road A second neural network model LSTM2 is set inside, and the second main road extracts the second vector feature set of the input text information through the second neural network model LSTM1.
该第一主干路分支为第一支路和第二支路,该第一支路内设置有第一预测分类器(predictor),该第一预测分类器用于根据该Boundary-tag标注类型标注出该第一向量特征集的实体边界;该第二支路内设置有第二预测分类器,该第二预测分类器用于根据该End-tag标注类型标注出该第一向量特征集的结束位置。The first trunk road branches into a first branch and a second branch, and a first predictor is provided in the first branch, and the first predictor is used to label according to the Boundary-tag label type The entity boundary of the first vector feature set; a second predictive classifier is arranged in the second branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag tagging type.
具体地,输入文本信息经LSTM1进行特征提取后会输出一组相应的第一向量特征集,记作h1,然后将该第一向量特征集h1分别同时传给第一支路和第二支路,其中,第一支路对应第一预测分类器,用于根据文本信息的第一向量特征集h1标注出其实体边界y boundary_tag,对应Boundary-tag标注;第二支路对应第二预测分类器,用于根据文本信息的第一向量特征集h1标注出其结束位置y end_tag,对应End-tag标注。 Specifically, after the input text information is feature extracted by LSTM1, a set of corresponding first vector feature sets will be output, denoted as h1, and then the first vector feature set h1 will be sent to the first branch and the second branch at the same time. , Where the first branch corresponds to the first predictive classifier, and is used to label the entity boundary y boundary_tag according to the first vector feature set h1 of the text information, corresponding to the Boundary-tag label; the second branch corresponds to the second predictive classifier , Used to mark the end position y end_tag according to the first vector feature set h1 of the text information, corresponding to the End-tag label.
此处,对于第二主干路而言,文本信息经LSTM2进行特征提取后会输出一组相应的第二向量特征集,记作h2,并且,在该第二主干路提取完该第二向量特征集后,该第二主干路分支为第三支路和最终输出支路;其中,该第三支路内设置有第三预测分类器,该第三预测分类器用于根据该Type-tage标注类型标注出该第一向量特征集的实体类型;该最终输出支路内设置有总预测分类器,该总预测分类器用于根据该第一向量特征集、该第二向量特征集以及该unified-tag标注类型标注出该输入文本信息的最终标注。Here, for the second trunk road, after the text information is feature extracted by LSTM2, a set of corresponding second vector feature sets will be output, denoted as h2, and the second vector feature is extracted on the second trunk road After the collection, the second main road branches into the third branch and the final output branch; wherein, a third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the type according to the Type-tage The entity type of the first vector feature set is marked; the final output branch is provided with a total prediction classifier, and the total prediction classifier is used according to the first vector feature set, the second vector feature set, and the unified-tag The label type marks the final label of the input text information.
具体地,将该第二向量特征集h2分别同时传给第三支路和最终输出支路,其中,第三支路对应第三预测分类器(predictor),用于根据输入文本信息的第二向量特征集h2标注出其实体类型y type_tage,对应Type-tage;最终输出支路对应总预测分类器(SC-BG),用于根据输入文本信息的第一向量特征集h1以及第二向量特征集h2对输入文本信息进行最终标注y unified_tag,对应unified-tag。 Specifically, the second vector feature set h2 is simultaneously transmitted to the third branch and the final output branch, respectively, where the third branch corresponds to the third predictor, which is used according to the second input text information. The vector feature set h2 marks its entity type y type_tage , corresponding to Type-tage; the final output branch corresponds to the total prediction classifier (SC-BG), which is used for the first vector feature set h1 and the second vector feature of the input text information Set h2 will finally mark the input text information y unified_tag , corresponding to unified-tag.
需要说明的是,LSTM(包括LSTM1和LSTM2)为一种现有的常用的神经网络模型,其具体结构为现有技术,在此不再赘述。通过使用该类神经网络模型可以获取输入文本信息的一组向量特征(h1或h2),当然LSTM需要搭配预测分类器使用,当使用训练样本集内训练样本对LSTM和与其相连的预测分类器(第一预测分类器、第一预测分类器、第一预测分类器以及总预测分类器)训练完毕后,LSTM输出的特征向量即会与各预测分类器之间产生所需关联,该关联可用模型训练参数W 1表示,W 1达到预设精度时,LSTM输出的特征向量为所需的特征向量。 It should be noted that LSTM (including LSTM1 and LSTM2) is an existing commonly used neural network model, and its specific structure is the existing technology, and will not be repeated here. By using this type of neural network model, a set of vector features (h1 or h2) of the input text information can be obtained. Of course, LSTM needs to be used with a predictive classifier. When using the training sample set in the training sample set, the LSTM and the predictive classifier connected to it ( The first predictive classifier, the first predictive classifier, the first predictive classifier and the total predictive classifier) After the training is completed, the feature vector output by the LSTM will generate the required association with each predictive classifier, and the association can be modeled The training parameter W 1 indicates that when W 1 reaches the preset accuracy, the feature vector output by the LSTM is the required feature vector.
需要进一步说明的是,当使用训练样本集内的所有训练样本对上述模型训练完毕后,该命名实体模型的精度即可到达所需的预设精度,此时,命名实体模型提取的向量特征h1与Boundary-tag和End-tage产生所需的联系,命名实体模型提取的向量特征h2与Type-tage产生所需的联系。当使用向量特征h1和h2识别待检测文本信息时,即可直接应用到boundary-tag,End-tag以及Type-tage的标记特性,从而通过y boundary_tag,y type_tage,y unified_tag辅助提高unified_tag的准确率。 It needs to be further explained that when all the training samples in the training sample set are used to train the above model, the accuracy of the named entity model can reach the required preset accuracy. At this time, the vector feature h1 extracted by the named entity model The required connection is generated with Boundary-tag and End-tage, and the vector feature h2 extracted by the named entity model is generated with the required connection with Type-tage. When the vector features h1 and h2 are used to identify the text information to be detected, it can be directly applied to the marking characteristics of boundary-tag, End-tag and Type- tage , so as to assist in improving the accuracy of unified_tag through y boundary_tag , y type_tage, and y unified_tag .
更为具体地,该第一预测分类器、该第二预测分类器以及该第三预测分类器内均设置有激活函数,该第一向量特征集或该第二向量特征集通过该激活函数后实现对该第一向量特征集或该第二向量特征集的标注;其中,该激活函数的计算过程如下:More specifically, an activation function is set in the first predictive classifier, the second predictive classifier, and the third predictive classifier. After the first vector feature set or the second vector feature set passes the activation function Implement the labeling of the first vector feature set or the second vector feature set; wherein, the calculation process of the activation function is as follows:
Figure PCTCN2020118331-appb-000001
Figure PCTCN2020118331-appb-000001
Figure PCTCN2020118331-appb-000002
Figure PCTCN2020118331-appb-000002
其中,W 1为该命名实体模型中需要训练的参数,与预测分类器的标注类型相关联;
Figure PCTCN2020118331-appb-000003
代指第一向量特征集或第二向量特征集,
Figure PCTCN2020118331-appb-000004
为输出的标注结果;Softmax函数为归一化函数,是用于将
Figure PCTCN2020118331-appb-000005
的值映射到(0,1)区间内:
Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
Figure PCTCN2020118331-appb-000003
Refers to the first vector feature set or the second vector feature set,
Figure PCTCN2020118331-appb-000004
Is the output labeling result; the Softmax function is a normalized function, which is used to
Figure PCTCN2020118331-appb-000005
The value of is mapped to the interval (0,1):
Figure PCTCN2020118331-appb-000006
Figure PCTCN2020118331-appb-000006
其中,[x 1,x 2,…,x i,x n]为一个数组,代指
Figure PCTCN2020118331-appb-000007
f(i)为第i个元素的softmax值。
Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
Figure PCTCN2020118331-appb-000007
f(i) is the softmax value of the i-th element.
为便于理解上述命名实体模型的数据处理流程,下面以“毕业于上海交通大学医学院”为具体的输入文本信息的示例对数据在模型中的流动与计算过程进行详细介绍。In order to facilitate the understanding of the data processing flow of the above named entity model, the following takes "graduated from Shanghai Jiaotong University School of Medicine" as an example of inputting text information to introduce the flow of data in the model and the calculation process in detail.
具体地,命名实体模型的前端第一主干路中的LSTM层记作LSTM1,输出第一向量特征集,记作h1,在t时刻的输出向量记为
Figure PCTCN2020118331-appb-000008
经过Softmax激活函数之后用来预测boundary-tag标注,例如,“毕业于上海交通大学医学院”对应的输出应该为“毕(O)业(O)于(O)上(B)海(I)交(I)通(I)大(I)学(I)医(O)学(O)院(O)”。输出记作y boundary_tag
Specifically, the LSTM layer in the first trunk road of the front-end of the named entity model is denoted as LSTM1, the first vector feature set is output, denoted as h1, and the output vector at time t is denoted as
Figure PCTCN2020118331-appb-000008
After the Softmax activation function, it is used to predict the boundary-tag label. For example, the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine" should be "Bi(O)业(O)于(O)上(B)海(I)) Jiao (I) Tong (I) University (I) School (I) Medicine (O) School (O) Hospital (O)". The output is denoted as y boundary_tag .
计算过程如下:The calculation process is as follows:
Figure PCTCN2020118331-appb-000009
Figure PCTCN2020118331-appb-000009
Figure PCTCN2020118331-appb-000010
Figure PCTCN2020118331-appb-000010
其中,Wb为模型中第一预测分类器中需要训练的参数。其中Softmax函数为归一化函数,是将
Figure PCTCN2020118331-appb-000011
的值映射到(0,1)区间内:
Among them, Wb is a parameter that needs to be trained in the first predictive classifier in the model. Among them, the Softmax function is a normalized function, which is
Figure PCTCN2020118331-appb-000011
The value of is mapped to the interval (0,1):
Figure PCTCN2020118331-appb-000012
Figure PCTCN2020118331-appb-000012
其中,[x 1,x 2,…,x i,x n]为一个数组,代指
Figure PCTCN2020118331-appb-000013
f(i)为第i个元素的softmax值。
Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
Figure PCTCN2020118331-appb-000013
f(i) is the softmax value of the i-th element.
与此同时,通过第二预测分类器来预测end-tag标注,即预测对应文本信息为0(非实体结束位置),还是1(实体结束位置);例如,“毕业于上海交通大学医学院”的对应输出为“毕(0)业(0)于(0)上(0)海(0)交(0)通(0)大(0)学(1)医(0)学(0)院(0)”。输出记作为y end_tagAt the same time, the second predictive classifier is used to predict the end-tag label, that is, whether the corresponding text information is predicted to be 0 (non-entity end position) or 1 (entity end position); for example, "Graduated from Shanghai Jiaotong University School of Medicine" The corresponding output is "Bi (0) industry (0) on (0) (0) sea (0) cross (0) pass (0) university (0) school (1) medicine (0) school (0) hospital (0)". The output is recorded as y end_tag .
计算过程如下:The calculation process is as follows:
Figure PCTCN2020118331-appb-000014
Figure PCTCN2020118331-appb-000014
Figure PCTCN2020118331-appb-000015
Figure PCTCN2020118331-appb-000015
其中,W e为模型中第二预测分类器中需要训练的参数。 Wherein, W e is the prediction model, a second classifier parameter training needs.
通过使用训练样本集不断训练优化模型中的第一主干路(第一预测分类器和第二预测 分类器),第一条主干路中的LSTM1的输出h1即可学习到的boundary-tag和end-tag两种分类特性。By using the training sample set to continuously train the first trunk road (first prediction classifier and second prediction classifier) in the optimization model, the output h1 of LSTM1 in the first trunk road can learn the boundary-tag and end -tag two classification features.
然而,在传统的CRF中,对于为本信息中的每一个名词,都是一视同仁的,由于很多机构都是以“学院”结尾,所以在CRF的预测中“上海交通大学医学院”常常会标记为一个整体实体,但是,我们希望最后的结果能识别到“上海交通大学”的这一粒度同时忽略“医学院”这个更低级别的实体。所以需要加强对实体的边界的识别,而本申请提供的命名实体模型的第一主干路相当于增加了实体的边界约束,可实现相应的预测功能。However, in traditional CRF, every term in this information is treated equally. Since many institutions end with "college", "Shanghai Jiaotong University School of Medicine" is often marked in CRF predictions It is a whole entity, but we hope that the final result can recognize the granularity of "Shanghai Jiaotong University" and ignore the lower-level entity of "Medical School". Therefore, it is necessary to strengthen the recognition of the boundary of the entity, and the first trunk road of the named entity model provided in this application is equivalent to adding the boundary constraint of the entity, which can realize the corresponding prediction function.
此外,命名实体模型的前端的第二主干路中的LSTM层记作LSTM2,在输入文本信息输入模型后,输出第二向量特征集,记作h2,在t时刻的输出向量记为
Figure PCTCN2020118331-appb-000016
通过softmax函数来预测签type_tag标注,即预测对应分类类型为JOB(工作单位)、EDU(教育经历的单位)等;例如“毕业于上海交通大学医学院”的对应输出应该为“毕(O)业(O)于(O)上(EDU)海(EDU)交(EDU)通(EDU)大(EDU)学(EDU)医(O)学(O)院(O)”。记作为y type_tage
In addition, the LSTM layer in the second trunk road at the front end of the named entity model is denoted as LSTM2. After inputting text information into the model, the second vector feature set is output, denoted as h2, and the output vector at time t is denoted as
Figure PCTCN2020118331-appb-000016
Use the softmax function to predict the type_tag label, that is, predict the corresponding classification type as JOB (work unit), EDU (unit of education experience), etc.; for example, the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine" should be "Complete (O)" Profession (O) on (O) (EDU) sea (EDU) cross (EDU) communication (EDU) university (EDU) school (EDU) medical (O) school (O) hospital (O)". Record it as y type_tage .
计算过程如下:The calculation process is as follows:
Figure PCTCN2020118331-appb-000017
Figure PCTCN2020118331-appb-000017
Figure PCTCN2020118331-appb-000018
Figure PCTCN2020118331-appb-000018
其中,W t模型中第三预测分类器中需要训练的参数。 Among them, the parameters that need to be trained in the third predictive classifier in the W t model.
另外,对于主要预测部分(对应总预测分类器(SC-BG)),引入预测组件BG(boundary guide)和SC(sentiment consistency),进一步整合第一向量特征集h1和第二向量特征集h2的数据以及内部隐藏特性,得到最终的预测结果,对应unified-tag分类标签,例如:“毕业于上海交通大学医学院”的对应输出为“毕(O)业(O)于(O)上(B-EDU)海(I-EDU)交(I-EDU)通(I-EDU)大(I-EDU)学(I-EDU)医(O)学(O)院(O)”,此处即为最终的目标标签。最终的输出记作为y unified_tagIn addition, for the main prediction part (corresponding to the total prediction classifier (SC-BG)), the prediction components BG (boundary guide) and SC (sentiment consistency) are introduced to further integrate the first vector feature set h1 and the second vector feature set h2. Data and internal hidden features to get the final prediction result, corresponding to the unified-tag classification label, for example: "Graduated from Shanghai Jiaotong University School of Medicine", the corresponding output is "Bi (O) Industry (O) on (O) on (B) -EDU) Sea (I-EDU) Cross (I-EDU) Tong (I-EDU) University (I-EDU) School (I-EDU) Medical (O) School (O) Hospital (O)", here is For the final target label. The final output is recorded as y unified_tag .
具体地,该总预测分类器内设置有第一预测组件SC和第二预测组件BG;其中,第一预测组件SC用于对该第二向量特征集进行优化,以增强该第二向量特征集中当前向量特征与前一时刻特征之间的联系;Specifically, the total prediction classifier is provided with a first prediction component SC and a second prediction component BG; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature set The relationship between the current vector feature and the previous moment feature;
该第二预测组件BG用于根据优化后的第二向量特征集、该第一向量特征集以及该unified-tag标注类型标注出该输入文本信息的最终标注。The second prediction component BG is used to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
其中,对于SC组件,输入为h2,输出为一组向量特征集,记为h3,在t时刻的输出向量记为
Figure PCTCN2020118331-appb-000019
其中的计算过程为:
Among them, for the SC component, the input is h2, the output is a set of vector feature sets, denoted as h3, and the output vector at time t is denoted as
Figure PCTCN2020118331-appb-000019
The calculation process is:
Figure PCTCN2020118331-appb-000020
Figure PCTCN2020118331-appb-000020
Figure PCTCN2020118331-appb-000021
Figure PCTCN2020118331-appb-000021
其中,sigmoid函数如下:Among them, the sigmoid function is as follows:
Figure PCTCN2020118331-appb-000022
Figure PCTCN2020118331-appb-000022
需要说明的是,对于⊙运算符,其为预先设定的线性运算符,例如,A⊙B=3A+2B,此处只要满足线性关系即可。It should be noted that for the ⊙ operator, it is a pre-set linear operator, for example, A⊙B=3A+2B, as long as the linear relationship is satisfied here.
对于BG组件,输入h1为和h3,输出为最终的标签(unified-tag),记作y unified_tag,预测过程如下,BG中含有一个boundary-tag到unified-tag的转移矩阵W trFor the BG component, the input h1 is and h3, and the output is the final tag (unified-tag), denoted as y unified_tag . The prediction process is as follows. BG contains a boundary-tag to unified-tag transition matrix W tr ,
Figure PCTCN2020118331-appb-000023
Figure PCTCN2020118331-appb-000023
其中,Bi为unified-tag的标签集合{B-EUD,I-EDU,B-STU,I-STU,O,…},|Bi|为该集合的大小。Among them, Bi is the unified-tag tag set {B-EUD, I-EDU, B-STU, I-STU, O,...}, and |Bi| is the size of the set.
然后通过转移矩阵,原本的z b变化为: Then through the transition matrix, the original z b is changed to:
Figure PCTCN2020118331-appb-000024
Figure PCTCN2020118331-appb-000024
其中,Z u’可以看作是由边界信息预测的最终标签,z b为第一预测分类器中的中间参数(可参照对于第一预测分类器的具体实施例),用z b自身计算得到Z u’这一标签的权重a t Among them, Z u'can be regarded as the final label predicted by the boundary information, and z b is the intermediate parameter in the first predictive classifier (refer to the specific embodiment for the first predictive classifier), which is calculated by z b itself The weight of the label Z u'a t
Figure PCTCN2020118331-appb-000025
Figure PCTCN2020118331-appb-000025
a t=∈c t a t =∈c t
其中,∈为预测的超参数,Among them, ∈ is the predicted hyperparameter,
最终标签的计算过程为:The calculation process of the final label is:
Figure PCTCN2020118331-appb-000026
Figure PCTCN2020118331-appb-000026
Figure PCTCN2020118331-appb-000027
Figure PCTCN2020118331-appb-000027
需要说明的是,随着样本训练集对命名实体模型的训练,W 1(包括W bW eW tW tr)也会随之进行变化,并越来越接近最优值,当命名实体模型训练完毕后,W 1一般会在最优值附近,此时即可使用该命名实体模型对待抽取文本信息进行机构实体信息提取。 It should be noted that as the sample training set trains the named entity model, W 1 (including W b W e W t W tr ) will also change accordingly, and get closer and closer to the optimal value. When the named entity model After the training is completed, W 1 will generally be near the optimal value. At this time, the named entity model can be used to extract institutional entity information from the text information to be extracted.
S130:获取待检测文本信息,并通过该命名实体模型对该待检测文本信息进行序列标注。S130: Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model.
具体地,可以用互联网或数据库中获取与行为人相关的待检测文本信息,如个人简历信息、个人主页信息等。Specifically, the Internet or a database can be used to obtain the text information to be detected related to the perpetrator, such as personal resume information, personal homepage information, and so on.
需要说明的是,待检测文本信息经命名实体模型标注后,会在四个输出端输出相应的标注序列,包括y boundary_tag,y end_tag,y type_tage和y unified_tag,由于y unified_tag中已经包含了y boundary_tag,y end_tag,y type_tage的相关序列特性信息,因此,在实际应用中,只需要获取待检测文本信息的y unified_tag序列标注即可。 It should be noted that after the text information to be detected is annotated by the named entity model, the corresponding annotation sequence will be output at the four output terminals, including y boundary_tag , y end_tag , y type_tage and y unified_tag , because y unified_tag already contains y boundary_tag , y end_tag , y type_tage related sequence characteristic information, therefore, in practical applications, only need to obtain the y unified_tag sequence annotation of the text information to be detected.
S140:根据该序列标注提取该待检测文本信息中的相关机构实体。S140: Extract relevant institutional entities in the text information to be detected according to the sequence labeling.
具体地,根据该yunified_tag序列标注,提取待检测文本信息中行为人的相关机构实体。Specifically, according to the yunified_tag sequence annotation, the relevant institutional entity of the actor in the text information to be detected is extracted.
需要说明的是,由于y unified_tag序列标注中包含了y end_tag序列标注的相关特性,因此,可以根据y unified_tag序列标注精准地确定所需机构实体的结束位置,避免机构实体定位不准确的问题。此外,由于y unified_tag序列标注中y type_tage序列标注的相关特性,因此,可以根据y unified_tag序列标注精准地确定机构实体的类别为“工作经历”、“教育经历”还是“短期进修经历”。 It should be noted that since the y unified_tag sequence label contains the relevant characteristics of the y end_tag sequence label, the end position of the required institutional entity can be accurately determined according to the y unified_tag sequence label to avoid the problem of inaccurate institutional entity positioning. In addition, due to the related characteristics of the y type_tage sequence labeling in the y unified_tag sequence labeling, it is possible to accurately determine the category of the institutional entity according to the y unified_tag sequence labeling as "work experience", "educational experience" or "short-term learning experience".
当然,还可以通过修改训练目标进一步拓展到更细的机构抽取,比如抽取二级机构("上海交通大学医学院"中的”医学院"),unified-tag的标签为"上(B-EDU)海(I-EDU)交(I-EDU)通(I-EDU)大(I-EDU)学(I-EDU)医(I-EDU)学(I-EDU)院(I-EDU)",boundary-tag标注为"上(B)海(I)交(I)通(I)大(I)学(I)医(I)学(I)院(I)",与上述过程同理,结束标签1应该打在“院”的位置。模型的框架无需改动,此时,即可实现二级机构的抽取。Of course, it can be further expanded to more detailed institution extraction by modifying the training target, such as extracting secondary institutions ("Medical School" in "Shanghai Jiaotong University School of Medicine"), the unified-tag label is "上 (B-EDU) ) Sea (I-EDU) Cross (I-EDU) Pass (I-EDU) University (I-EDU) School (I-EDU) Medical (I-EDU) School (I-EDU) Hospital (I-EDU)" ,boundary-tag is marked as "Shang (B) Hai (I) Jiao (I) Tong (I) University (I) Studies (I) Medicine (I) Studies (I) Hospital (I)", the same as the above process , The end tag 1 should be placed in the "court" position. The framework of the model does not need to be changed, at this time, the extraction of secondary institutions can be achieved.
通过上述技术方案的表述可知,本申请提出的基于多训练目标的机构实体抽取方法,通过设计一个多目标训练的命名实体模型,与传统的分别训练命名实体提取模型和文本分类模型相比,能够有效地避免误差传播,另外,针对常规的命名实体提取模型如LSTM+CRF不能很好地判断同一实体的不同类型以及边界精度识别不准的问题,本申请设计的多个训练目标的命名实体模型,强化了边界特征和语义特征的抽取,能够显著提高最终的预测精度,尤其是对边界的抓取,比传统的NER模型稳定得多。It can be seen from the expression of the above technical solution that the method for extracting institutional entities based on multiple training targets proposed in this application, by designing a named entity model for multi-target training, can be compared with the traditional separately trained named entity extraction model and text classification model. Effectively avoid error propagation. In addition, in view of the problem that conventional named entity extraction models such as LSTM+CRF cannot well judge the different types of the same entity and the problem of inaccurate recognition of boundary accuracy, the named entity model of multiple training targets designed in this application , Strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries, which is much more stable than the traditional NER model.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
实施例2Example 2
与上述方法相对应,本申请还提供一种基于多训练目标的机构实体抽取系统,该系统包括:Corresponding to the above-mentioned method, this application also provides an institutional entity extraction system based on multiple training targets, the system including:
样本标注单元,用于获取训练样本集并对该训练样本集内的各训练样本进行命名实体标注;The sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity;
模型训练单元,用于使用标注完成的训练样本集对预设的命名实体模型进行训练,以使该命名实体模型达到预设精度;其中,该命名实体模型包括第一主干路和第二主干路,该第一主干路用于提取输入文本信息的第一向量特征集,该第二主干路用于提取该输入文本信息的第二向量特征集;并且,该第二主干路还用于根据该第一向量特征集和该第二向量特征集对该输入文本信息进行序列标注;The model training unit is used to train the preset named entity model using the marked training sample set to make the named entity model reach the preset accuracy; wherein, the named entity model includes the first trunk road and the second trunk road , The first main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set of the input text information; The first vector feature set and the second vector feature set to sequence the input text information;
模型应用单元,用于获取待检测文本信息,并通过该命名实体模型对该待检测文本信息进行序列标注;The model application unit is used to obtain the text information to be detected, and to perform sequence labeling on the text information to be detected through the named entity model;
机构实体提取单元,用于根据该序列标注提取该待检测文本信息中的相关机构实体。The institution entity extraction unit is used to extract the relevant institution entity in the text information to be detected according to the sequence label.
实施例3Example 3
本申请还提供一种电子装置70。参照图2所示,该图为本申请提供的电子装置70的较佳实施例结构示意图。The application also provides an electronic device 70. Referring to FIG. 2, this figure is a schematic structural diagram of a preferred embodiment of the electronic device 70 provided by this application.
在本实施例中,电子装置70可以是服务器、智能手机、平板电脑、便携计算机、桌上型计算机等具有运算功能的终端设备。In this embodiment, the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.
该电子装置70包括:处理器71以及存储器72。The electronic device 70 includes a processor 71 and a memory 72.
存储器72包括至少一种类型的可读存储介质。至少一种类型的可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器等的非易失性存储介质。在一些实施例中,可读存储介质可以是该电子装置70的内部存储单元,例如该电子装置70的硬盘。在另一些实施例中,可读存储介质也可以是电子装置1的外部存储器,例如电子装置70上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The memory 72 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
在本实施例中,存储器72的可读存储介质通常用于存储安装于电子装置70的基于多训练目标的机构实体抽取程序73。存储器72还可以用于暂时地存储已经输出或者将要输出的数据。In this embodiment, the readable storage medium of the memory 72 is generally used to store the multi-training target-based institutional entity extraction program 73 installed in the electronic device 70. The memory 72 can also be used to temporarily store data that has been output or will be output.
处理器72在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器72中存储的程序代码或处理数据,例如基于多训练目标的机构实体抽取程序73等。In some embodiments, the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run the program code or process data stored in the memory 72, for example based on multiple training targets. The agency entity extraction procedures 73 and so on.
在一些实施例中,电子装置70为智能手机、平板电脑、便携计算机等的终端设备。在其他实施例中,电子装置70可以为服务器。In some embodiments, the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.
图2仅示出了具有组件71-73的电子装置70,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
可选地,该电子装置70还可以包括用户接口,用户接口可以包括输入单元比如键盘 (Keyboard)、语音输入装置比如麦克风(microphone)等具有语音识别功能的设备、语音输出装置比如音响、耳机等,可选地用户接口还可以包括标准的有线接口、无线接口。Optionally, the electronic device 70 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.
可选地,该电子装置70还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)触摸器等。显示器用于显示在电子装置70中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 70 may also include a display, and the display may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like. The display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.
可选地,该电子装置70还可以包括触摸传感器。触摸传感器所提供的供用户进行触摸操作的区域称为触控区域。此外,这里的触摸传感器可以为电阻式触摸传感器、电容式触摸传感器等。而且,触摸传感器不仅包括接触式的触摸传感器,也可包括接近式的触摸传感器等。此外,触摸传感器可以为单个传感器,也可以为例如阵列布置的多个传感器。Optionally, the electronic device 70 may also include a touch sensor. The area provided by the touch sensor for the user to perform touch operations is called the touch area. In addition, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.
此外,该电子装置70的显示器的面积可以与触摸传感器的面积相同,也可以不同。可选地,将显示器与触摸传感器层叠设置,以形成触摸显示屏。该装置基于触摸显示屏侦测用户触发的触控操作。In addition, the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.
可选地,该电子装置70还可以包括射频(Radio Frequency,RF)电路,传感器、音频电路等等,在此不再赘述。Optionally, the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.
在图2所示的装置实施例中,作为一种计算机存储介质的存储器72中可以包括操作系统、以及基于多训练目标的机构实体抽取程序73;处理器71执行存储器72中存储基于多训练目标的机构实体抽取程序73时实现如下步骤:In the device embodiment shown in FIG. 2, the memory 72, which is a computer storage medium, may include an operating system and an entity extraction program 73 based on multiple training targets; the processor 71 executes the memory 72 to store information based on multiple training targets. The following steps are implemented in the entity extraction procedure 73:
获取训练样本集,并对该训练样本集内的各训练样本进行命名实体标注;Obtain a training sample set, and label each training sample in the training sample set with a named entity;
使用标注完成的训练样本集对预设的命名实体模型进行训练,以使该命名实体模型达到预设精度;其中,该命名实体模型包括第一主干路和第二主干路,该第一主干路用于提取输入文本信息的第一向量特征集,该第二主干路用于提取该输入文本信息的第二向量特征集;并且,该第二主干路还用于根据该第一向量特征集和该第二向量特征集对该输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road Is used to extract the first vector feature set of input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set and Perform sequence labeling on the input text information with the second vector feature set;
获取待检测文本信息,并通过该命名实体模型对该待检测文本信息进行序列标注;Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;
根据该序列标注提取该待检测文本信息中的相关机构实体。Extract the relevant institutional entities in the text information to be detected according to the sequence annotations.
在该实施例中,图3为根据本申请实施例的基于多训练目标的机构实体抽取程序的内部逻辑示意图,如图3所示,基于多训练目标的机构实体抽取程序73还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器72中,并由处理器71执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。参照图3所示,为图2中基于多训练目标的机构实体抽取程序73较佳实施例的程序模块图。基于多训练目标的机构实体抽取程序73可以被分割为:样本标注模块74、模型训练模块75、模型应用模块76以及机构实体提取模块77。模块74-76所实现的功能或操作步骤均与上文类似,此处不再详述,示例性地,例如,其中:In this embodiment, FIG. 3 is an internal logic diagram of an organization entity extraction program based on multiple training targets according to an embodiment of the present application. As shown in FIG. 3, the organization entity extraction program 73 based on multiple training targets can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Referring to FIG. 3, it is a program module diagram of a preferred embodiment of the organization entity extraction program 73 based on multiple training targets in FIG. The organization entity extraction program 73 based on multiple training targets can be divided into: a sample labeling module 74, a model training module 75, a model application module 76, and an organization entity extraction module 77. The functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:
样本标注模块74,用于获取训练样本集并对该训练样本集内的各训练样本进行命名实体标注;The sample labeling module 74 is used to obtain a training sample set and label each training sample in the training sample set with a named entity;
模型训练模块75,用于使用标注完成的训练样本集对预设的命名实体模型进行训练,以使该命名实体模型达到预设精度;其中,该命名实体模型包括第一主干路和第二主干路,该第一主干路用于提取输入文本信息的第一向量特征集,该第二主干路用于提取该输入文本信息的第二向量特征集;并且,该第二主干路还用于根据该第一向量特征集和该第二向量特征集对该输入文本信息进行序列标注;The model training module 75 is configured to use the marked training sample set to train a preset named entity model so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk The first main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information with the first vector feature set and the second vector feature set;
模型应用模块76,用于获取待检测文本信息,并通过该命名实体模型对该待检测文本信息进行序列标注;The model application module 76 is used to obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;
机构实体提取模块77,用于根据该序列标注提取该待检测文本信息中的相关机构实体。The institution entity extraction module 77 is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
实施例4Example 4
本申请还提供一种计算机可读存储介质,计算机可读存储介质中存储有基于多训练目标的机构实体抽取程序73,基于多训练目标的机构实体抽取程序73被处理器执行时实现如下操作:The present application also provides a computer-readable storage medium. The computer-readable storage medium stores an organization entity extraction program 73 based on multiple training targets. When the organization entity extraction program 73 based on multiple training goals is executed by a processor, the following operations are implemented:
获取训练样本集,并对该训练样本集内的各训练样本进行命名实体标注;Obtain a training sample set, and label each training sample in the training sample set with a named entity;
使用标注完成的训练样本集对预设的命名实体模型进行训练,以使该命名实体模型达到预设精度;其中,该命名实体模型包括第一主干路和第二主干路,该第一主干路用于提取输入文本信息的第一向量特征集,该第二主干路用于提取该输入文本信息的第二向量特征集;并且,该第二主干路还用于根据该第一向量特征集和该第二向量特征集对该输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road Is used to extract the first vector feature set of input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set and Perform sequence labeling on the input text information with the second vector feature set;
获取待检测文本信息,并通过该命名实体模型对该待检测文本信息进行序列标注;Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;
根据该序列标注提取该待检测文本信息中的相关机构实体。Extract the relevant institutional entities in the text information to be detected according to the sequence annotations.
本申请提供的计算机可读存储介质的具体实施方式与上述基于多训练目标的机构实体抽取方法、电子装置的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium provided in this application is substantially the same as the specific implementation of the above-mentioned method for extracting institutional entities and electronic devices based on multiple training targets, and will not be repeated here.
可选的,该计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the computer-readable storage medium may be non-volatile or volatile.
需要说明的是,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。It should be noted that the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
需要进一步说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be further clarified that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements , But also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例的方法。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium such as ROM/RAM, magnetic Disk, optical disk) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the methods of the various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于多训练目标的机构实体抽取方法,应用于电子装置,其中,所述方法包括:A method for extracting institutional entities based on multiple training targets, applied to an electronic device, wherein the method includes:
    获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
    使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
    通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
    根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  2. 根据权利要求1所述的基于多训练目标的机构实体抽取方法,其中,The method for extracting institutional entities based on multiple training targets according to claim 1, wherein:
    所述训练样本集存储于区块链中;并且,在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用BIO的标注方式进行标注;其中,The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,
    B用于标注机构实体的开头,I用于标注机构实体本身,O用于标注所述训练样本中与机构实体不相关的信息。B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample.
  3. 根据权利要求2所述的基于多训练目标的机构实体抽取方法,其中,所述在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用的标注类型包括:Boundary-tag、End-tag、Type-tage以及unified-tag;其中,The method for extracting institutional entities based on multiple training targets according to claim 2, wherein in the process of performing named entity labeling on each training sample in the training sample set, the label type used includes: Boundary-tag , End-tag, Type-tage and unified-tag; among them,
    Boundary-tag类型用于标注出所述训练样本中的机构实体的边界,End-tag类型用于标注出所述训练样本中的机构实体的结束位置,Type-tage类型用于标注出训练样本中的机构实体的实体类型,unified-tag类型用于作为最终的目标标签。The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
  4. 根据权利要求3所述的基于多训练目标的机构实体抽取方法,其中,在所述第一主干路提取完所述第一向量特征集后,所述第一主干路分支为第一支路和第二支路;在所述第二主干路提取完所述第二向量特征集后,所述第二主干路分支为第三支路和最终输出支路;其中,The method for extracting institutional entities based on multiple training targets according to claim 3, wherein, after the first trunk road extracts the first vector feature set, the first trunk road branches into the first branch and The second branch; after the second main road has extracted the second vector feature set, the second main road is branched into a third branch and a final output branch; wherein,
    在所述第一支路内设置有第一预测分类器,所述第一预测分类器用于根据所述Boundary-tag标注类型标注出所述第一向量特征集的实体边界;在所述第二支路内设置有第二预测分类器,所述第二预测分类器用于根据所述End-tag标注类型标注出所述第一向量特征集的结束位置;A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;
    在所述第三支路内设置有第三预测分类器,所述第三预测分类器用于根据所述Type-tage标注类型标注出所述第一向量特征集的实体类型;在所述最终输出支路内设置有总预测分类器,所述总预测分类器用于根据所述第一向量特征集、所述第二向量特征集以及所述unified-tag标注类型标注出所述输入文本信息的最终标注。A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
  5. 根据权利要求4所述的基于多训练目标的机构实体抽取方法,其中,The method for extracting institutional entities based on multiple training targets according to claim 4, wherein:
    在所述第一主干路内设置有第一神经网络模型LSTM1,所述第一主干路通过所述第一神经网络模型LSTM1提取所述输入文本信息的第一向量特征集;A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;
    在所述第二主干路内设置有第二神经网络模型LSTM2,所述第二主干路通过所述第二神经网络模型LSTM1提取所述输入文本信息的第二向量特征集。A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
  6. 根据权利要求5所述的基于多训练目标的机构实体抽取方法,其中,The method for extracting institutional entities based on multiple training targets according to claim 5, wherein:
    所述第一预测分类器、所述第二预测分类器以及所述第三预测分类器内均设置有激活函数,所述第一向量特征集或所述第二向量特征集通过所述激活函数后实现对所述第一向量特征集或所述第二向量特征集的标注;其中,所述激活函数的计算过程如下:The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:
    Figure PCTCN2020118331-appb-100001
    Figure PCTCN2020118331-appb-100001
    Figure PCTCN2020118331-appb-100002
    Figure PCTCN2020118331-appb-100002
    其中,W 1为所述命名实体模型中需要训练的参数,与预测分类器的标注类型相关联;
    Figure PCTCN2020118331-appb-100003
    代指第一向量特征集或第二向量特征集,
    Figure PCTCN2020118331-appb-100004
    为输出的标注结果;Softmax函数为归一化函数,是用于将
    Figure PCTCN2020118331-appb-100005
    的值映射到(0,1)区间内:
    Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
    Figure PCTCN2020118331-appb-100003
    Refers to the first vector feature set or the second vector feature set,
    Figure PCTCN2020118331-appb-100004
    Is the output labeling result; the Softmax function is a normalized function, which is used to
    Figure PCTCN2020118331-appb-100005
    The value of is mapped to the interval (0,1):
    Figure PCTCN2020118331-appb-100006
    Figure PCTCN2020118331-appb-100006
    其中,[x 1,x 2,…,x i,x n]为一个数组,代指
    Figure PCTCN2020118331-appb-100007
    f(i)为第i个元素的softmax值。
    Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
    Figure PCTCN2020118331-appb-100007
    f(i) is the softmax value of the i-th element.
  7. 根据权利要求6所述的基于多训练目标的机构实体抽取方法,其中,The method for extracting institutional entities based on multiple training targets according to claim 6, wherein:
    在所述总预测分类器内设置有第一预测组件SC和第二预测组件BG;其中,第一预测组件SC用于对所述第二向量特征集进行优化,以增强所述第二向量特征集中当前向量特征与前一时刻向量特征之间的联系;A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;
    所述第二预测组件BG用于根据优化后的第二向量特征集、所述第一向量特征集以及所述unified-tag标注类型标注出所述输入文本信息的最终标注。The second prediction component BG is configured to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
  8. 一种基于多训练目标的机构实体抽取系统,其中,所述系统包括:A system entity extraction system based on multiple training targets, wherein the system includes:
    样本标注单元,用于获取训练样本集并对所述训练样本集内的各训练样本进行命名实体标注;The sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity;
    模型训练单元,用于使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;The model training unit is configured to use the marked training sample set to train a preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. A trunk road, the first trunk road is used to extract a first vector feature set of the input text information, the second trunk road is used to extract a second vector feature set of the input text information; and, the second trunk The path is also used to perform sequence labeling on the input text information according to the first vector feature set and the second vector feature set;
    模型应用单元,用于通过所述命名实体模型对获取的待检测文本信息进行序列标注;The model application unit is configured to perform sequence labeling on the acquired text information to be detected through the named entity model;
    机构实体提取单元,用于根据所述序列标注提取所述待检测文本信息中的相关机构实体。The institution entity extraction unit is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
  9. 一种电子装置,其中,所述电子装置包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的基于多训练目标的机构实体抽取程序,所述基于多训练目标的机构实体抽取程序被所述处理器执行时实现如下步骤:An electronic device, wherein the electronic device includes: a memory, a processor, and an organization entity extraction program based on multiple training targets that is stored in the memory and can be run on the processor. The following steps are implemented when the organization entity extraction program of is executed by the processor:
    获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
    使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
    通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
    根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  10. 根据权利要求9所述的电子装置,其中,The electronic device according to claim 9, wherein:
    所述训练样本集存储于区块链中;并且,在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用BIO的标注方式进行标注;其中,The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,
    B用于标注机构实体的开头,I用于标注机构实体本身,O用于标注所述训练样本中与机构实体不相关的信息;B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample;
    其中,所述在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用的标注类型包括:Boundary-tag、End-tag、Type-tage以及unified-tag;其中,Wherein, in the process of performing named entity labeling on each training sample in the training sample set, the labeling types used include: Boundary-tag, End-tag, Type-tage, and unified-tag; among them,
    Boundary-tag类型用于标注出所述训练样本中的机构实体的边界,End-tag类型用于标注出所述训练样本中的机构实体的结束位置,Type-tage类型用于标注出训练样本中的机构实体的实体类型,unified-tag类型用于作为最终的目标标签。The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
  11. 根据权利要求10所述的电子装置,其中,在所述第一主干路提取完所述第一向量特征集后,所述第一主干路分支为第一支路和第二支路;在所述第二主干路提取完所述第二向量特征集后,所述第二主干路分支为第三支路和最终输出支路;其中,The electronic device according to claim 10, wherein after the first main road has extracted the first vector feature set, the first main road is branched into a first branch and a second branch; After the second trunk road extracts the second vector feature set, the second trunk road branches into a third branch and a final output branch; wherein,
    在所述第一支路内设置有第一预测分类器,所述第一预测分类器用于根据所述Boundary-tag标注类型标注出所述第一向量特征集的实体边界;在所述第二支路内设置有第二预测分类器,所述第二预测分类器用于根据所述End-tag标注类型标注出所述第一向量特征集的结束位置;A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;
    在所述第三支路内设置有第三预测分类器,所述第三预测分类器用于根据所述Type-tage标注类型标注出所述第一向量特征集的实体类型;在所述最终输出支路内设置有总预测分类器,所述总预测分类器用于根据所述第一向量特征集、所述第二向量特征集以及所述unified-tag标注类型标注出所述输入文本信息的最终标注。A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
  12. 根据权利要求11所述的电子装置,其中,The electronic device according to claim 11, wherein:
    在所述第一主干路内设置有第一神经网络模型LSTM1,所述第一主干路通过所述第一神经网络模型LSTM1提取所述输入文本信息的第一向量特征集;A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;
    在所述第二主干路内设置有第二神经网络模型LSTM2,所述第二主干路通过所述第二神经网络模型LSTM1提取所述输入文本信息的第二向量特征集。A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
  13. 根据权利要求12所述的电子装置,其中,The electronic device according to claim 12, wherein:
    所述第一预测分类器、所述第二预测分类器以及所述第三预测分类器内均设置有激活函数,所述第一向量特征集或所述第二向量特征集通过所述激活函数后实现对所述第一向量特征集或所述第二向量特征集的标注;其中,所述激活函数的计算过程如下:The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:
    Figure PCTCN2020118331-appb-100008
    Figure PCTCN2020118331-appb-100008
    Figure PCTCN2020118331-appb-100009
    Figure PCTCN2020118331-appb-100009
    其中,W 1为所述命名实体模型中需要训练的参数,与预测分类器的标注类型相关联;
    Figure PCTCN2020118331-appb-100010
    代指第一向量特征集或第二向量特征集,
    Figure PCTCN2020118331-appb-100011
    为输出的标注结果;Softmax函数为归一化函数,是用于将
    Figure PCTCN2020118331-appb-100012
    的值映射到(0,1)区间内:
    Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
    Figure PCTCN2020118331-appb-100010
    Refers to the first vector feature set or the second vector feature set,
    Figure PCTCN2020118331-appb-100011
    Is the output labeling result; the Softmax function is a normalized function, which is used to
    Figure PCTCN2020118331-appb-100012
    The value of is mapped to the interval (0,1):
    Figure PCTCN2020118331-appb-100013
    Figure PCTCN2020118331-appb-100013
    其中,[x 1,x 2,…,x i,x n]为一个数组,代指
    Figure PCTCN2020118331-appb-100014
    f(i)为第i个元素的softmax值。
    Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
    Figure PCTCN2020118331-appb-100014
    f(i) is the softmax value of the i-th element.
  14. 根据权利要求13所述的电子装置,其中,The electronic device according to claim 13, wherein:
    在所述总预测分类器内设置有第一预测组件SC和第二预测组件BG;其中,第一预测组件SC用于对所述第二向量特征集进行优化,以增强所述第二向量特征集中当前向量特征与前一时刻向量特征之间的联系;A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;
    所述第二预测组件BG用于根据优化后的第二向量特征集、所述第一向量特征集以及 所述unified-tag标注类型标注出所述输入文本信息的最终标注。The second prediction component BG is used to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有基于多训练目标的机构实体抽取程序,所述基于多训练目标的机构实体抽取程序被处理器执行时,实现如以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores an organization entity extraction program based on multiple training goals, and when the organization entity extraction program based on multiple training goals is executed by a processor, the implementation is as follows step:
    获取训练样本集,并对所述训练样本集内的各训练样本进行命名实体标注;Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;
    使用标注完成的训练样本集对预设的命名实体模型进行训练,以使所述命名实体模型达到预设精度;其中,所述命名实体模型包括第一主干路和第二主干路,所述第一主干路用于提取输入文本信息的第一向量特征集,所述第二主干路用于提取所述输入文本信息的第二向量特征集;并且,所述第二主干路还用于根据所述第一向量特征集和所述第二向量特征集对所述输入文本信息进行序列标注;Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;
    通过所述命名实体模型对获取的待检测文本信息进行序列标注;Performing sequence labeling on the acquired text information to be detected through the named entity model;
    根据所述序列标注提取所述待检测文本信息中的相关机构实体。Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
  16. 根据权利要求15所述的计算机可读存储介质,其中,The computer-readable storage medium according to claim 15, wherein:
    所述训练样本集存储于区块链中;并且,在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用BIO的标注方式进行标注;其中,The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,
    B用于标注机构实体的开头,I用于标注机构实体本身,O用于标注所述训练样本中与机构实体不相关的信息;B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample;
    其中,所述在对所述训练样本集内的各训练样本进行命名实体标注的过程中,使用的标注类型包括:Boundary-tag、End-tag、Type-tage以及unified-tag;其中,Wherein, in the process of performing named entity labeling on each training sample in the training sample set, the labeling types used include: Boundary-tag, End-tag, Type-tage, and unified-tag; among them,
    Boundary-tag类型用于标注出所述训练样本中的机构实体的边界,End-tag类型用于标注出所述训练样本中的机构实体的结束位置,Type-tage类型用于标注出训练样本中的机构实体的实体类型,unified-tag类型用于作为最终的目标标签。The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
  17. 根据权利要求16所述的计算机可读存储介质,其中,在所述第一主干路提取完所述第一向量特征集后,所述第一主干路分支为第一支路和第二支路;在所述第二主干路提取完所述第二向量特征集后,所述第二主干路分支为第三支路和最终输出支路;其中,The computer-readable storage medium according to claim 16, wherein, after the first vector feature set is extracted by the first main road, the first main road branches into a first branch and a second branch After the second main road has extracted the second vector feature set, the second main road is branched into a third branch and a final output branch; wherein,
    在所述第一支路内设置有第一预测分类器,所述第一预测分类器用于根据所述Boundary-tag标注类型标注出所述第一向量特征集的实体边界;在所述第二支路内设置有第二预测分类器,所述第二预测分类器用于根据所述End-tag标注类型标注出所述第一向量特征集的结束位置;A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;
    在所述第三支路内设置有第三预测分类器,所述第三预测分类器用于根据所述Type-tage标注类型标注出所述第一向量特征集的实体类型;在所述最终输出支路内设置有总预测分类器,所述总预测分类器用于根据所述第一向量特征集、所述第二向量特征集以及所述unified-tag标注类型标注出所述输入文本信息的最终标注。A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
  18. 根据权利要求17所述的计算机可读存储介质,其中,The computer-readable storage medium according to claim 17, wherein:
    在所述第一主干路内设置有第一神经网络模型LSTM1,所述第一主干路通过所述第一神经网络模型LSTM1提取所述输入文本信息的第一向量特征集;A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;
    在所述第二主干路内设置有第二神经网络模型LSTM2,所述第二主干路通过所述第二神经网络模型LSTM1提取所述输入文本信息的第二向量特征集。A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
  19. 根据权利要求18所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 18, wherein:
    所述第一预测分类器、所述第二预测分类器以及所述第三预测分类器内均设置有激活函数,所述第一向量特征集或所述第二向量特征集通过所述激活函数后实现对所述第一向量特征集或所述第二向量特征集的标注;其中,所述激活函数的计算过程如下:The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:
    Figure PCTCN2020118331-appb-100015
    Figure PCTCN2020118331-appb-100015
    Figure PCTCN2020118331-appb-100016
    Figure PCTCN2020118331-appb-100016
    其中,W 1为所述命名实体模型中需要训练的参数,与预测分类器的标注类型相关联;
    Figure PCTCN2020118331-appb-100017
    代指第一向量特征集或第二向量特征集,
    Figure PCTCN2020118331-appb-100018
    为输出的标注结果;Softmax函数为归一化函数,是用于将
    Figure PCTCN2020118331-appb-100019
    的值映射到(0,1)区间内:
    Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
    Figure PCTCN2020118331-appb-100017
    Refers to the first vector feature set or the second vector feature set,
    Figure PCTCN2020118331-appb-100018
    Is the output labeling result; the Softmax function is a normalized function, which is used to
    Figure PCTCN2020118331-appb-100019
    The value of is mapped to the interval (0,1):
    Figure PCTCN2020118331-appb-100020
    Figure PCTCN2020118331-appb-100020
    其中,[x 1,x 2,…,x i,x n]为一个数组,代指
    Figure PCTCN2020118331-appb-100021
    f(i)为第i个元素的softmax值。
    Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
    Figure PCTCN2020118331-appb-100021
    f(i) is the softmax value of the i-th element.
  20. 根据权利要求19所述的计算机可读存储介质,其中,The computer-readable storage medium of claim 19, wherein:
    在所述总预测分类器内设置有第一预测组件SC和第二预测组件BG;其中,第一预测组件SC用于对所述第二向量特征集进行优化,以增强所述第二向量特征集中当前向量特征与前一时刻向量特征之间的联系;A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;
    所述第二预测组件BG用于根据优化后的第二向量特征集、所述第一向量特征集以及所述unified-tag标注类型标注出所述输入文本信息的最终标注。The second prediction component BG is configured to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
PCT/CN2020/118331 2020-07-28 2020-09-28 Mechanism entity extraction method, system and device based on multiple training targets WO2021139239A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010738252.X 2020-07-28
CN202010738252.XA CN111881692B (en) 2020-07-28 2020-07-28 Mechanism entity extraction method, system and device based on multiple training targets

Publications (1)

Publication Number Publication Date
WO2021139239A1 true WO2021139239A1 (en) 2021-07-15

Family

ID=73201874

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118331 WO2021139239A1 (en) 2020-07-28 2020-09-28 Mechanism entity extraction method, system and device based on multiple training targets

Country Status (2)

Country Link
CN (1) CN111881692B (en)
WO (1) WO2021139239A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN110162772A (en) * 2018-12-13 2019-08-23 北京三快在线科技有限公司 Name entity recognition method and device
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN110705294A (en) * 2019-09-11 2020-01-17 苏宁云计算有限公司 Named entity recognition model training method, named entity recognition method and device
CN111428501A (en) * 2019-01-09 2020-07-17 北大方正集团有限公司 Named entity recognition method, recognition system and computer readable storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075228B (en) * 2006-05-15 2012-05-23 松下电器产业株式会社 Method and apparatus for named entity recognition in natural language
US11593558B2 (en) * 2017-08-31 2023-02-28 Ebay Inc. Deep hybrid neural network for named entity recognition
KR102043353B1 (en) * 2017-12-04 2019-11-12 주식회사 솔루게이트 Apparatus and method for recognizing Korean named entity using deep-learning
CN110287479B (en) * 2019-05-20 2022-07-22 平安科技(深圳)有限公司 Named entity recognition method, electronic device and storage medium
CN110399616A (en) * 2019-07-31 2019-11-01 国信优易数据有限公司 Name entity detection method, device, electronic equipment and readable storage medium storing program for executing
CN110866115B (en) * 2019-10-16 2023-08-08 平安科技(深圳)有限公司 Sequence labeling method, system, computer equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635280A (en) * 2018-11-22 2019-04-16 园宝科技(武汉)有限公司 A kind of event extraction method based on mark
CN110162772A (en) * 2018-12-13 2019-08-23 北京三快在线科技有限公司 Name entity recognition method and device
CN111428501A (en) * 2019-01-09 2020-07-17 北大方正集团有限公司 Named entity recognition method, recognition system and computer readable storage medium
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN110705294A (en) * 2019-09-11 2020-01-17 苏宁云计算有限公司 Named entity recognition model training method, named entity recognition method and device

Also Published As

Publication number Publication date
CN111881692A (en) 2020-11-03
CN111881692B (en) 2023-01-13

Similar Documents

Publication Publication Date Title
CN111090987B (en) Method and apparatus for outputting information
Au-Yong-Oliveira et al. The potential of big data research in healthcare for medical doctors’ learning
CN109785927A (en) Clinical document structuring processing method based on internet integration medical platform
Mihalas et al. History of medical informatics in Europe-a short review by different approach
WO2021120588A1 (en) Method and apparatus for language generation, computer device, and storage medium
CN109446328A (en) A kind of text recognition method, device and its storage medium
CN113627797B (en) Method, device, computer equipment and storage medium for generating staff member portrait
CN113657105A (en) Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
Crampton Ambient virtual scribes: Mutuo Health’s AutoScribe as a case study of artificial intelligence-based technology
Alymani et al. Graph machine learning classification using architectural 3D topological models
CN113657086B (en) Word processing method, device, equipment and storage medium
US9881004B2 (en) Gender and name translation from a first to a second language
WO2021139239A1 (en) Mechanism entity extraction method, system and device based on multiple training targets
Klochko et al. Data mining of the healthcare system based on the machine learning model developed in the Microsoft azure machine learning studio
Kumar Attar et al. The emergence of Natural Language Processing (NLP) techniques in healthcare AI
Sangeetha et al. Modelling of E-governance framework for mining knowledge from massive grievance redressal data
CN113657547B (en) Public opinion monitoring method based on natural language processing model and related equipment thereof
CN112199954B (en) Disease entity matching method and device based on voice semantics and computer equipment
WO2023272563A1 (en) Intelligent triage method and apparatus, and storage medium and electronic device
CN115662562A (en) Medical record diagnosis and treatment data management method, device, equipment and storage medium
Nguyen et al. A study of predicting the sincerity of a question asked using machine learning
CN114637831A (en) Data query method based on semantic analysis and related equipment thereof
de Silva Relational databases and biomedical big data
Sedransk et al. Make research data public?—not always so simple: a dialogue for statisticians and science editors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911398

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911398

Country of ref document: EP

Kind code of ref document: A1