WO2021139239A1

WO2021139239A1 - Mechanism entity extraction method, system and device based on multiple training targets

Info

Publication number: WO2021139239A1
Application number: PCT/CN2020/118331
Authority: WO
Inventors: 柴玲
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-07-28
Filing date: 2020-09-28
Publication date: 2021-07-15
Also published as: CN111881692A; CN111881692B

Abstract

A mechanism entity extraction method based on multiple training targets. The method comprises: acquiring a training sample set, and carrying out named entity labeling on each training sample in the training sample set; using the labeled training sample set to train a preset named entity model, such that the named entity model reaches a preset precision; carrying out, by means of the named entity model, sequence labeling on acquired text information to be detected; and according to sequence labeling, extracting a related mechanism entity from the text information to be detected. The present invention further relates to blockchain technology. The training sample set is stored in a blockchain. The method can effectively solve the problems of the low efficiency and poor quality of an existing mechanism entity extraction method.

Description

Organization entity extraction method, system and device based on multiple training targets

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 28, 2020, the application number is 202010738252.X, and the invention title is "Method, System and Device for Extracting Institutional Entity Based on Multiple Training Objectives", all of which The content is incorporated in this application by reference.

Technical field

This application relates to the technical field of information extraction, and in particular to a method, system, device and storage medium for extracting institutional entities based on multiple training targets.

Background technique

At present, many academic scholar libraries, such as AMINER, ORCID, etc., provide scholar information so that users can track the research direction and progress of a scholar or his team. For example, some expert team projects are deeply involved in the field of medical research, and are committed to building an expert database in the medical field and building a complete expert knowledge map.

However, in the process of constructing the expert knowledge graph, establishing a network of experts and institutions is a valuable and difficult task. The reason is that an expert is a self-selective actor who will conduct work among institutions. Mobility; for example, expert A may study a doctorate in institution A, work in institution B, and have studied in institution C during the period. However, the current common scholar libraries (such as AMINER, ORCID, etc.) generally only provide the scholar's current institution. In fact, the inventor realized that a scholar's complete scientific research portrait is closely related to the institution he once stayed in.

Obviously, it is unrealistic to collect hundreds of thousands of experts (taking Chinese medical-related scholars as an example) of educational, work, and advanced education experiences through human resources. Therefore, the inventor realizes that a large section of text information on the scholar's homepage can be obtained from the Internet. So the question becomes how to use artificial intelligence algorithms to extract structured knowledge related to the scholar from the mixed text information.

For example, the relevant profile field of a scholar on the Internet is "After graduating in June 1990, he went to the First Affiliated Hospital of Guangzhou Medical College to work in the Department of Oncology and Hematology, and obtained a doctorate in clinical medicine from Southern Medical University in June 2008. December 2008 From January to May 2009, he studied at the Sun Yat-sen University Cancer Center.", you need to extract three institutional entities from the "First Affiliated Hospital of Guangzhou Medical College", "Southern Medical University" and "Sun Yat-Sen University", and identify The three belong to "work experience", "education experience", and "short-term study experience".

Based on the above problems, an efficient and high-quality method for extracting institutional entities is urgently needed.

Summary of the invention

This application provides a method, system, electronic device, and computer storage medium for extracting institutional entities based on multiple training targets. The main purpose of the method is to solve the problem of low efficiency and poor quality of existing institutional entity extraction methods.

In order to achieve the above objective, this application provides a method for extracting institutional entities based on multiple training objectives, the method including the following steps:

Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;

Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;

Performing sequence labeling on the acquired text information to be detected through the named entity model;

Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.

In addition, this application also provides an institutional entity extraction system based on multiple training targets, the system including:

The sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity;

The model training unit is used to train a preset named entity model using the marked training sample set, so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. The main road, the first main road is used to extract a first vector feature set of the input text information, the second main road is used to extract a second vector feature set of the input text information; and, the second main road The path is also used to perform sequence labeling on the input text information according to the first vector feature set and the second vector feature set;

The model application unit is configured to perform sequence labeling on the acquired text information to be detected through the named entity model;

The institution entity extraction unit is configured to extract relevant institution entities in the text information to be detected according to the sequence label.

In addition, in order to achieve the above object, the present application also provides an electronic device, the electronic device comprising: a memory, a processor, and an organization entity extraction based on multiple training targets that is stored in the memory and can run on the processor A program, when the multi-training target-based organization entity extraction program is executed by the processor, the following steps are implemented:

In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium in which a multi-training target-based organization entity extraction program is stored, and the multi-training target-based organization entity extraction program is stored in the computer-readable storage medium. When executed by the processor, the following steps are implemented:

This application can effectively avoid error propagation. In addition, the named entity model of multiple training targets designed in this application strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries. It is much more stable than the traditional NER model.

Description of the drawings

FIG. 1 is a flowchart of a preferred embodiment of a method for extracting institutional entities based on multiple training targets according to an embodiment of the present application;

2 is a schematic structural diagram of a preferred embodiment of an electronic device according to an embodiment of the present application;

Fig. 3 is a schematic diagram of the internal logic of an organization entity extraction program based on multiple training targets according to an embodiment of the present application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

In the following description, for illustrative purposes, in order to provide a comprehensive understanding of one or more embodiments, many specific details are set forth. However, it is obvious that these embodiments can also be implemented without these specific details.

The technical solution of this application can be applied to the fields of artificial intelligence, blockchain and/or big data technology, and the data involved, such as training sample sets, can be stored in a database, or can be stored in a blockchain, such as through a blockchain Distributed storage is not limited in this application.

Before specifically introducing the embodiment of the multi-training target-based institutional entity extraction method provided by this application, it should be noted that there are mainly two methods for traditional entity extraction. One is the staged training model: first train the named entity The extraction model identifies all institutional entities, "Southern Medical University" is marked as [B-ORG,I-ORG,I-ORG,I-ORG,E-ORG], and then the text classification model recognizes that the institutional entity belongs to the type of experience "Work experience (JOB)", "education experience (EDU)" or "short-term study experience (STU)". But the obvious drawback of this solution is that the error of the first model will be transferred to the second model and amplify the error.

Another common solution is to train an end-to-end named entity extraction model, such as LSTM+CRF. A unified tag for each entity, "Southern Medical University" is directly labeled [B-EDU,I-EDU,I-EDU,I-EDU,I-EDU,E-EDU], "Zhongshan "University" is directly marked as [B-STU, I-STU, I-STU, E-STU], and the end-to-end named entity extraction model is trained. In this way, error propagation in the traditional scheme of separate training can be avoided.

However, the simple named entity extraction model of LSTM+CRF still cannot solve the two problems unique to the scene of detailed extraction of profile agencies: one: the same entity has different labels in different contexts: "Shanghai Sixth People's Hospital" It is not only the "education experience" institution of doctor A, but also the "work experience" of doctor B, and the "education experience" and "work experience" of doctor C. Therefore, the difficulty of capturing the information of the context scene is higher than the general named entity extraction problem. The second is the boundary problem. In order to ensure the unification of the input structured knowledge, the extraction of institutions is reserved to the granularity of independent units (universities, hospitals, etc.), such as "Sun Yat-sen University Cancer Center". We hope that the final results can identify The level of "Sun Yat-Sen University" also ignores the "Cancer Center", while the "Beijing Cancer Center" is an independent entity. Obviously, the traditional end-to-end named entity extraction model cannot do this. Therefore, there is an urgent need for a more efficient and high-quality institutional entity extraction method.

The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Example 1

In order to illustrate the method for extracting institutional entities based on multiple training targets provided in this application, FIG. 1 shows the flow of the method for extracting institutional entities based on multiple training objectives provided in this application.

As shown in Figure 1, the method for extracting institutional entities based on multiple training targets provided by this application includes:

S110: Obtain a training sample set, and label each training sample in the training sample set with a named entity.

It should be noted that the sample here is a piece of text information that contains the entity of the institution. For example, it can be a paragraph from a job resume or a piece of text information on the homepage of a scholar on the Internet.

Specifically, in the process of labeling each training sample in the training sample set, the named entity labeling method used in this application is the BIO labeling method, where B is used to label the beginning of the institutional entity, and I is used to label The institutional entity itself, O is used to mark information in the sample that is not related to the institutional entity.

In addition, in order to achieve the multi-target training of the later model, this application needs to perform multiple types of annotations on each sample in the training sample set, including at least four types, such as: Boundary-tag, End-tag, Type-tage, and There are four types of unified-tag tags. Different types of tags have different labeling methods. Of course, the corresponding labeling functions are also different. The Boundary-tag type is mainly used to label the boundaries of the institutional entities in the sample; the End-tag type is mainly used to label the samples in the sample. The end position of the institutional entity; Type-tage is mainly used to mark the entity type of the institutional entity, such as graduate colleges, workplaces, internship places, and so on. The unified-tag type is the final target tag. After the four types of labeling are completed, the sample is saved to the training sample set.

In addition, it should be emphasized that, in order to further ensure the privacy and security of the data in the training sample set, the training sample set can be stored in a node of the blockchain.

S120: Use the marked training sample set to train a preset named entity model so that the named entity model achieves a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. The main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set The set and the second vector feature set perform sequence labeling on the input text information.

It should be noted that the named entity model is a new type of sequence labeling model designed by ourselves, which combines the training links of multiple targets; specifically, the named entity model includes two main roads, and the first main road is set There is a first neural network model LSTM1, the first main road extracts the first vector feature set of the input text information (corresponding to the training sample or the later text information to be detected) through the first neural network model LSTM1; the second main road A second neural network model LSTM2 is set inside, and the second main road extracts the second vector feature set of the input text information through the second neural network model LSTM1.

The first trunk road branches into a first branch and a second branch, and a first predictor is provided in the first branch, and the first predictor is used to label according to the Boundary-tag label type The entity boundary of the first vector feature set; a second predictive classifier is arranged in the second branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag tagging type.

Specifically, after the input text information is feature extracted by LSTM1, a set of corresponding first vector feature sets will be output, denoted as h1, and then the first vector feature set h1 will be sent to the first branch and the second branch at the same time. , Where the first branch corresponds to the first predictive classifier, and is used to label the entity boundary y ^boundary_tag according to the first vector feature set h1 of the text information, corresponding to the Boundary-tag label; the second branch corresponds to the second predictive classifier ^{, Used to mark the end position y end_tag} according to the first vector feature set h1 of the text information, corresponding to the End-tag label.

Here, for the second trunk road, after the text information is feature extracted by LSTM2, a set of corresponding second vector feature sets will be output, denoted as h2, and the second vector feature is extracted on the second trunk road After the collection, the second main road branches into the third branch and the final output branch; wherein, a third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the type according to the Type-tage The entity type of the first vector feature set is marked; the final output branch is provided with a total prediction classifier, and the total prediction classifier is used according to the first vector feature set, the second vector feature set, and the unified-tag The label type marks the final label of the input text information.

Specifically, the second vector feature set h2 is simultaneously transmitted to the third branch and the final output branch, respectively, where the third branch corresponds to the third predictor, which is used according to the second input text information. The vector feature set h2 marks its entity type y ^type_tage , corresponding to Type-tage; the final output branch corresponds to the total prediction classifier (SC-BG), which is used for the first vector feature set h1 and the second vector feature of the input text information Set h2 will finally mark the input text information y ^unified_tag , corresponding to unified-tag.

It should be noted that LSTM (including LSTM1 and LSTM2) is an existing commonly used neural network model, and its specific structure is the existing technology, and will not be repeated here. By using this type of neural network model, a set of vector features (h1 or h2) of the input text information can be obtained. Of course, LSTM needs to be used with a predictive classifier. When using the training sample set in the training sample set, the LSTM and the predictive classifier connected to it ( The first predictive classifier, the first predictive classifier, the first predictive classifier and the total predictive classifier) After the training is completed, the feature vector output by the LSTM will generate the required association with each predictive classifier, and the association can be modeled The training parameter W ¹ indicates that when W ¹ reaches the preset accuracy, the feature vector output by the LSTM is the required feature vector.

It needs to be further explained that when all the training samples in the training sample set are used to train the above model, the accuracy of the named entity model can reach the required preset accuracy. At this time, the vector feature h1 extracted by the named entity model The required connection is generated with Boundary-tag and End-tage, and the vector feature h2 extracted by the named entity model is generated with the required connection with Type-tage. When the vector features h1 and h2 are used to identify the text information to be detected, it can be directly applied to the marking characteristics of boundary-tag, End-tag and Type- ^tage ^{, so as to} assist in improving the accuracy of ^{unified_tag through y boundary_tag} , y type_tage, and y unified_tag .

More specifically, an activation function is set in the first predictive classifier, the second predictive classifier, and the third predictive classifier. After the first vector feature set or the second vector feature set passes the activation function Implement the labeling of the first vector feature set or the second vector feature set; wherein, the calculation process of the activation function is as follows:

Wherein, W ^{1 is} a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;

Refers to the first vector feature set or the second vector feature set,

Is the output labeling result; the Softmax function is a normalized function, which is used to

The value of is mapped to the interval (0,1):

Among them, [x ₁ ,x ₂ ,…,x _i ,x _n ] is an array, which refers to

f(i) is the softmax value of the i-th element.

In order to facilitate the understanding of the data processing flow of the above named entity model, the following takes "graduated from Shanghai Jiaotong University School of Medicine" as an example of inputting text information to introduce the flow of data in the model and the calculation process in detail.

Specifically, the LSTM layer in the first trunk road of the front-end of the named entity model is denoted as LSTM1, the first vector feature set is output, denoted as h1, and the output vector at time t is denoted as

After the Softmax activation function, it is used to predict the boundary-tag label. For example, the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine" should be "Bi(O)业(O)于(O)上(B)海(I)) Jiao (I) Tong (I) University (I) School (I) Medicine (O) School (O) Hospital (O)". The output is denoted as y ^boundary_tag .

The calculation process is as follows:

Among them, Wb is a parameter that needs to be trained in the first predictive classifier in the model. Among them, the Softmax function is a normalized function, which is

The value of is mapped to the interval (0,1):

Among them, [x ₁ ,x ₂ ,…,x _i ,x _n ] is an array, which refers to

f(i) is the softmax value of the i-th element.

At the same time, the second predictive classifier is used to predict the end-tag label, that is, whether the corresponding text information is predicted to be 0 (non-entity end position) or 1 (entity end position); for example, "Graduated from Shanghai Jiaotong University School of Medicine" The corresponding output is "Bi (0) industry (0) on (0) (0) sea (0) cross (0) pass (0) university (0) school (1) medicine (0) school (0) hospital (0)". The output is recorded as y ^end_tag .

The calculation process is as follows:

Wherein, W ^e is the prediction model, a second classifier parameter training needs.

By using the training sample set to continuously train the first trunk road (first prediction classifier and second prediction classifier) in the optimization model, the output h1 of LSTM1 in the first trunk road can learn the boundary-tag and end -tag two classification features.

However, in traditional CRF, every term in this information is treated equally. Since many institutions end with "college", "Shanghai Jiaotong University School of Medicine" is often marked in CRF predictions It is a whole entity, but we hope that the final result can recognize the granularity of "Shanghai Jiaotong University" and ignore the lower-level entity of "Medical School". Therefore, it is necessary to strengthen the recognition of the boundary of the entity, and the first trunk road of the named entity model provided in this application is equivalent to adding the boundary constraint of the entity, which can realize the corresponding prediction function.

In addition, the LSTM layer in the second trunk road at the front end of the named entity model is denoted as LSTM2. After inputting text information into the model, the second vector feature set is output, denoted as h2, and the output vector at time t is denoted as

Use the softmax function to predict the type_tag label, that is, predict the corresponding classification type as JOB (work unit), EDU (unit of education experience), etc.; for example, the corresponding output of "Graduated from Shanghai Jiaotong University School of Medicine" should be "Complete (O)" Profession (O) on (O) (EDU) sea (EDU) cross (EDU) communication (EDU) university (EDU) school (EDU) medical (O) school (O) hospital (O)". Record it as y ^type_tage .

The calculation process is as follows:

Among them, the parameters that need to be trained in the third predictive classifier in the ^{W t model.}

In addition, for the main prediction part (corresponding to the total prediction classifier (SC-BG)), the prediction components BG (boundary guide) and SC (sentiment consistency) are introduced to further integrate the first vector feature set h1 and the second vector feature set h2. Data and internal hidden features to get the final prediction result, corresponding to the unified-tag classification label, for example: "Graduated from Shanghai Jiaotong University School of Medicine", the corresponding output is "Bi (O) Industry (O) on (O) on (B) -EDU) Sea (I-EDU) Cross (I-EDU) Tong (I-EDU) University (I-EDU) School (I-EDU) Medical (O) School (O) Hospital (O)", here is For the final target label. The final output is recorded as y ^unified_tag .

Specifically, the total prediction classifier is provided with a first prediction component SC and a second prediction component BG; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature set The relationship between the current vector feature and the previous moment feature;

The second prediction component BG is used to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.

Among them, for the SC component, the input is h2, the output is a set of vector feature sets, denoted as h3, and the output vector at time t is denoted as

The calculation process is:

Among them, the sigmoid function is as follows:

It should be noted that for the ⊙ operator, it is a pre-set linear operator, for example, A⊙B=3A+2B, as long as the linear relationship is satisfied here.

For the BG component, the input h1 is and h3, and the output is the final tag (unified-tag), denoted as y ^unified_tag . The prediction process is as follows. BG contains a boundary-tag to unified-tag transition matrix W ^tr ,

Among them, Bi is the unified-tag tag set {B-EUD, I-EDU, B-STU, I-STU, O,...}, and |Bi| is the size of the set.

Then through the transition matrix, the original z ^{b is} changed to:

Among them, Z ^u'can be regarded as the final label predicted by the boundary information, and z ^b is the intermediate parameter in the first predictive classifier (refer to the specific embodiment for the first predictive classifier), which is calculated ^{by z b itself} The weight of the label Z ^u'a ^t

a _t =∈c _t

Among them, ∈ is the predicted hyperparameter,

The calculation process of the final label is:

It should be noted that as the sample training set trains the named entity model, W ¹ (including W ^b W ^e W ^t W ^tr ) will also change accordingly, and get closer and closer to the optimal value. When the named entity model After the training is completed, W ¹ will generally be near the optimal value. At this time, the named entity model can be used to extract institutional entity information from the text information to be extracted.

S130: Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model.

Specifically, the Internet or a database can be used to obtain the text information to be detected related to the perpetrator, such as personal resume information, personal homepage information, and so on.

It should be noted that after the text information to be detected is annotated by the named entity model, the corresponding annotation sequence will be output at the four output terminals, including y ^boundary_tag , y ^end_tag , y ^type_tage and y ^unified_tag , because y ^unified_tag already contains y ^boundary_tag , y ^end_tag , y ^type_tage related sequence characteristic information, therefore, in practical applications, only need to obtain the y ^unified_tag sequence annotation of the text information to be detected.

S140: Extract relevant institutional entities in the text information to be detected according to the sequence labeling.

Specifically, according to the yunified_tag sequence annotation, the relevant institutional entity of the actor in the text information to be detected is extracted.

It should be noted that since the y ^unified_tag sequence label contains the ^{relevant characteristics of the y end_tag} sequence label, the end position of the required institutional entity can be accurately determined according to the y ^unified_tag sequence label to avoid the problem of inaccurate institutional entity positioning. In addition, due to ^{the related characteristics of the y type_tage} sequence labeling in the y ^unified_tag sequence labeling, it is possible to ^{accurately determine the category of the institutional entity according to the y unified_tag} sequence labeling as "work experience", "educational experience" or "short-term learning experience".

Of course, it can be further expanded to more detailed institution extraction by modifying the training target, such as extracting secondary institutions ("Medical School" in "Shanghai Jiaotong University School of Medicine"), the unified-tag label is "上 (B-EDU) ) Sea (I-EDU) Cross (I-EDU) Pass (I-EDU) University (I-EDU) School (I-EDU) Medical (I-EDU) School (I-EDU) Hospital (I-EDU)" ,boundary-tag is marked as "Shang (B) Hai (I) Jiao (I) Tong (I) University (I) Studies (I) Medicine (I) Studies (I) Hospital (I)", the same as the above process , The end tag 1 should be placed in the "court" position. The framework of the model does not need to be changed, at this time, the extraction of secondary institutions can be achieved.

It can be seen from the expression of the above technical solution that the method for extracting institutional entities based on multiple training targets proposed in this application, by designing a named entity model for multi-target training, can be compared with the traditional separately trained named entity extraction model and text classification model. Effectively avoid error propagation. In addition, in view of the problem that conventional named entity extraction models such as LSTM+CRF cannot well judge the different types of the same entity and the problem of inaccurate recognition of boundary accuracy, the named entity model of multiple training targets designed in this application , Strengthens the extraction of boundary features and semantic features, and can significantly improve the final prediction accuracy, especially the capture of boundaries, which is much more stable than the traditional NER model.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

Example 2

Corresponding to the above-mentioned method, this application also provides an institutional entity extraction system based on multiple training targets, the system including:

The model training unit is used to train the preset named entity model using the marked training sample set to make the named entity model reach the preset accuracy; wherein, the named entity model includes the first trunk road and the second trunk road , The first main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set of the input text information; The first vector feature set and the second vector feature set to sequence the input text information;

The model application unit is used to obtain the text information to be detected, and to perform sequence labeling on the text information to be detected through the named entity model;

The institution entity extraction unit is used to extract the relevant institution entity in the text information to be detected according to the sequence label.

Example 3

The application also provides an electronic device 70. Referring to FIG. 2, this figure is a schematic structural diagram of a preferred embodiment of the electronic device 70 provided by this application.

In this embodiment, the electronic device 70 may be a terminal device with a computing function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 70 includes a processor 71 and a memory 72.

The memory 72 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 70, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.

In this embodiment, the readable storage medium of the memory 72 is generally used to store the multi-training target-based institutional entity extraction program 73 installed in the electronic device 70. The memory 72 can also be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 72 may be a central processing unit (CPU), a microprocessor or other data processing chip, which is used to run the program code or process data stored in the memory 72, for example based on multiple training targets. The agency entity extraction procedures 73 and so on.

In some embodiments, the electronic device 70 is a terminal device such as a smart phone, a tablet computer, and a portable computer. In other embodiments, the electronic device 70 may be a server.

FIG. 2 only shows the electronic device 70 with the components 71-73, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

Optionally, the electronic device 70 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.

Optionally, the electronic device 70 may also include a display, and the display may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) touch device, and the like. The display is used for displaying information processed in the electronic device 70 and for displaying a visualized user interface.

Optionally, the electronic device 70 may also include a touch sensor. The area provided by the touch sensor for the user to perform touch operations is called the touch area. In addition, the touch sensor here may be a resistive touch sensor, a capacitive touch sensor, or the like. Moreover, the touch sensor includes not only a contact type touch sensor, but also a proximity type touch sensor and the like. In addition, the touch sensor may be a single sensor, or may be, for example, a plurality of sensors arranged in an array.

In addition, the area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, the display and the touch sensor are stacked to form a touch display screen. The device detects the touch operation triggered by the user based on the touch screen.

Optionally, the electronic device 70 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.

In the device embodiment shown in FIG. 2, the memory 72, which is a computer storage medium, may include an operating system and an entity extraction program 73 based on multiple training targets; the processor 71 executes the memory 72 to store information based on multiple training targets. The following steps are implemented in the entity extraction procedure 73:

Obtain a training sample set, and label each training sample in the training sample set with a named entity;

Use the marked training sample set to train the preset named entity model so that the named entity model reaches the preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road Is used to extract the first vector feature set of input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set and Perform sequence labeling on the input text information with the second vector feature set;

Obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;

Extract the relevant institutional entities in the text information to be detected according to the sequence annotations.

In this embodiment, FIG. 3 is an internal logic diagram of an organization entity extraction program based on multiple training targets according to an embodiment of the present application. As shown in FIG. 3, the organization entity extraction program 73 based on multiple training targets can also be divided into One or more modules, one or more modules are stored in the memory 72 and executed by the processor 71 to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions. Referring to FIG. 3, it is a program module diagram of a preferred embodiment of the organization entity extraction program 73 based on multiple training targets in FIG. The organization entity extraction program 73 based on multiple training targets can be divided into: a sample labeling module 74, a model training module 75, a model application module 76, and an organization entity extraction module 77. The functions or operation steps implemented by modules 74-76 are similar to the above, and will not be described in detail here. Illustratively, for example, where:

The sample labeling module 74 is used to obtain a training sample set and label each training sample in the training sample set with a named entity;

The model training module 75 is configured to use the marked training sample set to train a preset named entity model so that the named entity model reaches a preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk The first main road is used to extract the first vector feature set of the input text information, the second main road is used to extract the second vector feature set of the input text information; and the second main road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information with the first vector feature set and the second vector feature set;

The model application module 76 is used to obtain the text information to be detected, and perform sequence labeling on the text information to be detected through the named entity model;

The institution entity extraction module 77 is configured to extract relevant institution entities in the text information to be detected according to the sequence label.

Example 4

The present application also provides a computer-readable storage medium. The computer-readable storage medium stores an organization entity extraction program 73 based on multiple training targets. When the organization entity extraction program 73 based on multiple training goals is executed by a processor, the following operations are implemented:

The specific implementation of the computer-readable storage medium provided in this application is substantially the same as the specific implementation of the above-mentioned method for extracting institutional entities and electronic devices based on multiple training targets, and will not be repeated here.

Optionally, the computer-readable storage medium may be non-volatile or volatile.

It should be noted that the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

It should be further clarified that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements , But also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium such as ROM/RAM, magnetic Disk, optical disk) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the methods of the various embodiments of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for extracting institutional entities based on multiple training targets, applied to an electronic device, wherein the method includes:

Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;

Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;

Performing sequence labeling on the acquired text information to be detected through the named entity model;

Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
The method for extracting institutional entities based on multiple training targets according to claim 1, wherein:

The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,

B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample.
The method for extracting institutional entities based on multiple training targets according to claim 2, wherein in the process of performing named entity labeling on each training sample in the training sample set, the label type used includes: Boundary-tag , End-tag, Type-tage and unified-tag; among them,

The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
The method for extracting institutional entities based on multiple training targets according to claim 3, wherein, after the first trunk road extracts the first vector feature set, the first trunk road branches into the first branch and The second branch; after the second main road has extracted the second vector feature set, the second main road is branched into a third branch and a final output branch; wherein,

A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;

A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
The method for extracting institutional entities based on multiple training targets according to claim 4, wherein:

A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;

A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
The method for extracting institutional entities based on multiple training targets according to claim 5, wherein:

The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:

Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
Refers to the first vector feature set or the second vector feature set,
Is the output labeling result; the Softmax function is a normalized function, which is used to
The value of is mapped to the interval (0,1):

Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
f(i) is the softmax value of the i-th element.
The method for extracting institutional entities based on multiple training targets according to claim 6, wherein:

A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;

The second prediction component BG is configured to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
A system entity extraction system based on multiple training targets, wherein the system includes:

The sample labeling unit is used to obtain a training sample set and label each training sample in the training sample set with a named entity;

The model training unit is configured to use the marked training sample set to train a preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road. A trunk road, the first trunk road is used to extract a first vector feature set of the input text information, the second trunk road is used to extract a second vector feature set of the input text information; and, the second trunk The path is also used to perform sequence labeling on the input text information according to the first vector feature set and the second vector feature set;

The model application unit is configured to perform sequence labeling on the acquired text information to be detected through the named entity model;

The institution entity extraction unit is configured to extract relevant institution entities in the text information to be detected according to the sequence label.
An electronic device, wherein the electronic device includes: a memory, a processor, and an organization entity extraction program based on multiple training targets that is stored in the memory and can be run on the processor. The following steps are implemented when the organization entity extraction program of is executed by the processor:

Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;

Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;

Performing sequence labeling on the acquired text information to be detected through the named entity model;

Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
The electronic device according to claim 9, wherein:

The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,

B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample;

Wherein, in the process of performing named entity labeling on each training sample in the training sample set, the labeling types used include: Boundary-tag, End-tag, Type-tage, and unified-tag; among them,

The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
The electronic device according to claim 10, wherein after the first main road has extracted the first vector feature set, the first main road is branched into a first branch and a second branch; After the second trunk road extracts the second vector feature set, the second trunk road branches into a third branch and a final output branch; wherein,

A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;

A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
The electronic device according to claim 11, wherein:

A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;

A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
The electronic device according to claim 12, wherein:

The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:

Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
Refers to the first vector feature set or the second vector feature set,
Is the output labeling result; the Softmax function is a normalized function, which is used to
The value of is mapped to the interval (0,1):

Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
f(i) is the softmax value of the i-th element.
The electronic device according to claim 13, wherein:

A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;

The second prediction component BG is used to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.
A computer-readable storage medium, wherein the computer-readable storage medium stores an organization entity extraction program based on multiple training goals, and when the organization entity extraction program based on multiple training goals is executed by a processor, the implementation is as follows step:

Acquiring a training sample set, and labeling each training sample in the training sample set with a named entity;

Use the marked training sample set to train the preset named entity model so that the named entity model achieves preset accuracy; wherein, the named entity model includes a first trunk road and a second trunk road, and the first trunk road A trunk road is used to extract the first vector feature set of the input text information, the second trunk road is used to extract the second vector feature set of the input text information; and the second trunk road is also used to extract the first vector feature set of the input text information; Performing sequence labeling on the input text information by the first vector feature set and the second vector feature set;

Performing sequence labeling on the acquired text information to be detected through the named entity model;

Extracting relevant institutional entities in the text information to be detected according to the sequence annotations.
The computer-readable storage medium according to claim 15, wherein:

The training sample set is stored in the blockchain; and, in the process of labeling each training sample in the training sample set with a named entity, the BIO labeling method is used for labeling; wherein,

B is used to mark the beginning of the institution entity, I is used to mark the institution entity itself, and O is used to mark information that is not related to the institution entity in the training sample;

Wherein, in the process of performing named entity labeling on each training sample in the training sample set, the labeling types used include: Boundary-tag, End-tag, Type-tage, and unified-tag; among them,

The Boundary-tag type is used to mark the boundary of the institutional entity in the training sample, the End-tag type is used to mark the end position of the institutional entity in the training sample, and the Type-tage type is used to mark the training sample. The entity type of the institutional entity. The unified-tag type is used as the final target tag.
The computer-readable storage medium according to claim 16, wherein, after the first vector feature set is extracted by the first main road, the first main road branches into a first branch and a second branch After the second main road has extracted the second vector feature set, the second main road is branched into a third branch and a final output branch; wherein,

A first predictive classifier is provided in the first branch, and the first predictive classifier is used to mark the entity boundary of the first vector feature set according to the Boundary-tag tagging type; in the second A second predictive classifier is provided in the branch, and the second predictive classifier is used to mark the end position of the first vector feature set according to the End-tag label type;

A third predictive classifier is provided in the third branch, and the third predictive classifier is used to label the entity type of the first vector feature set according to the Type-tage label type; in the final output A total prediction classifier is provided in the branch, and the total prediction classifier is used to mark the final result of the input text information according to the first vector feature set, the second vector feature set, and the unified-tag label type. Label.
The computer-readable storage medium according to claim 17, wherein:

A first neural network model LSTM1 is provided in the first main road, and the first main road extracts a first vector feature set of the input text information through the first neural network model LSTM1;

A second neural network model LSTM2 is provided in the second main road, and the second main road extracts a second vector feature set of the input text information through the second neural network model LSTM1.
The computer-readable storage medium of claim 18, wherein:

The first predictive classifier, the second predictive classifier, and the third predictive classifier are all provided with an activation function, and the first vector feature set or the second vector feature set passes through the activation function Then, the labeling of the first vector feature set or the second vector feature set is implemented; wherein, the calculation process of the activation function is as follows:

Wherein, W 1 is a parameter that needs to be trained in the named entity model, and is associated with the label type of the prediction classifier;
Refers to the first vector feature set or the second vector feature set,
Is the output labeling result; the Softmax function is a normalized function, which is used to
The value of is mapped to the interval (0,1):

Among them, [x 1 ,x 2 ,…,x i ,x n ] is an array, which refers to
f(i) is the softmax value of the i-th element.
The computer-readable storage medium of claim 19, wherein:

A first prediction component SC and a second prediction component BG are provided in the total prediction classifier; wherein, the first prediction component SC is used to optimize the second vector feature set to enhance the second vector feature Concentrate the relationship between the current vector feature and the previous moment vector feature;

The second prediction component BG is configured to mark the final label of the input text information according to the optimized second vector feature set, the first vector feature set, and the unified-tag label type.