CN113010638A - Entity recognition model generation method and device and entity extraction method and device - Google Patents

Entity recognition model generation method and device and entity extraction method and device Download PDF

Info

Publication number
CN113010638A
CN113010638A CN202110208364.9A CN202110208364A CN113010638A CN 113010638 A CN113010638 A CN 113010638A CN 202110208364 A CN202110208364 A CN 202110208364A CN 113010638 A CN113010638 A CN 113010638A
Authority
CN
China
Prior art keywords
sample
entity
target type
preset
statement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110208364.9A
Other languages
Chinese (zh)
Other versions
CN113010638B (en
Inventor
李凯
周晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Credit Service Co ltd
Original Assignee
Beijing Jindi Credit Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Credit Service Co ltd filed Critical Beijing Jindi Credit Service Co ltd
Priority to CN202110208364.9A priority Critical patent/CN113010638B/en
Publication of CN113010638A publication Critical patent/CN113010638A/en
Application granted granted Critical
Publication of CN113010638B publication Critical patent/CN113010638B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure provide an entity recognition model generation method and apparatus, an entity extraction method and apparatus, a computer-readable storage medium, an electronic device, and a computer program. The method comprises the following steps: acquiring a first sample statement set; training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model; acquiring a second sample statement set; and training an initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set and the negative sample sentences and the corresponding negative sample marking information to obtain an entity classification model. The technical scheme of the method and the device can realize automatic extraction of the target type entity, and can accurately and comprehensively obtain the target type entity and the category of the target type entity in the text.

Description

Entity recognition model generation method and device and entity extraction method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating an entity identification model, a method and an apparatus for extracting an entity, an electronic device, a computer-readable storage medium, and a computer program.
Background
Named Entity Recognition (NER) refers to the process of identifying a specific object transaction name or symbol from text. The named entity recognition technology is an indispensable component in various natural language processing tasks such as information extraction, information retrieval, machine translation, question and answer systems and the like.
Currently, in many fields, it is necessary to extract specific types of entities from existing texts to provide various services for users. For example, parsing the public information of a bidding website may provide valuable information to businesses and users. The bidding field information is complex, and the extraction method widely adopted at present realizes information extraction by constructing a corresponding regular expression template. There are also some automated extraction techniques such as web page information extraction techniques based on web page structural features, web page information extraction techniques based on wrapper induction, and the like.
Disclosure of Invention
An object of the present disclosure is to provide a method and an apparatus for generating an entity identification model, a method and an apparatus for extracting an entity, an electronic device, a computer-readable storage medium, and a computer program, thereby solving the technical problems described in the background art at least to a certain extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to a first aspect of the present disclosure, there is provided a method of generating an entity recognition model, comprising: acquiring a first sample statement set, wherein sample statements in the first sample statement set comprise target type entities and have corresponding entity marking information for representing the target type entities; training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model; acquiring a second sample statement set, wherein the second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-type target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-type target type entities and have corresponding negative sample marking information; and training an initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set and the negative sample sentences and the corresponding negative sample marking information to obtain an entity classification model.
In an exemplary embodiment of the present disclosure, obtaining a first set of sample statements comprises: extracting a first sample statement set from a preset sample text, and determining a target type entity from sample statements included in the first sample statement set; and generating entity marking information corresponding to the sample sentences included in the sample sentence subset based on the positions of the target type entities in the sample sentences.
In an exemplary embodiment of the present disclosure, obtaining a second set of sample statements comprises: extracting an initial sample statement set from a preset sample text; determining sample sentences comprising preset categories of target type entities from the initial sample sentence set, and determining sample sentences comprising non-preset categories of target type entities; determining a statement pair consisting of a sample statement where a preset type of target type entity is located and the preset type of target type entity as a positive sample statement, and generating positive sample marking information representing the preset type of target type entity; determining a statement pair consisting of a sample statement where the target type entity of the non-preset category is located and the target type entity of the non-preset category as a negative sample statement, and generating negative sample marking information representing the target type entity of the non-preset category.
In an exemplary embodiment of the present disclosure, determining a sample statement including a target type entity of a non-preset category includes: determining a target type entity from the sample sentences in the initial sample sentence set by using a target type entity recognition model; comparing each determined target type entity with a preset type target type entity to obtain a non-preset type target type entity; and determining the sample statement where the target type entity of the non-preset type is located as the sample statement comprising the target type entity of the non-preset type.
In an exemplary embodiment of the present disclosure, determining a target type entity from sample statements comprised by a first set of sample statements comprises: determining a target type entity from the sample sentences included in the first sample sentence set by using at least one of the following modes: determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; and secondly, searching the target type entity from the sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
In an exemplary embodiment of the present disclosure, before extracting the first set of sample sentences from the preset sample text, the method further comprises: and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
According to a second aspect of the present disclosure, there is provided an entity extraction method, comprising: acquiring a text to be identified; inputting a text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by pre-training based on the method of the first aspect; determining sentences to be classified from texts to be recognized based on the target type entity; and inputting the statement to be classified into a pre-trained entity classification model to obtain entity class information representing the class of the target type entity, wherein the entity classification model is obtained by training in advance based on the method of the first aspect.
In an exemplary embodiment of the present disclosure, acquiring a text to be recognized includes: acquiring an original text; and preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In an exemplary embodiment of the present disclosure, determining a sentence to be classified from a text to be recognized based on a target type entity includes: and forming a statement pair by the statement where the target type entity is located and the target type entity, and determining the statement pair as the statement to be classified.
According to a third aspect of the present disclosure, there is provided an entity recognition model generation apparatus including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first sample statement set, and sample statements in the first sample statement set comprise target type entities and have corresponding entity marking information for representing the target type entities; the first training module is used for training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model; the second obtaining module is used for obtaining a second sample statement set, wherein the second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-type target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-type target type entities and have corresponding negative sample marking information; and the second training module is used for training the initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set, and the negative sample sentences and the corresponding negative sample marking information to obtain the entity classification model.
In an exemplary embodiment of the present disclosure, the first obtaining module includes: the first extraction unit is used for extracting a first sample statement set from a preset sample text and determining a target type entity from sample statements included in the first sample statement set; and the generating unit is used for generating entity marking information corresponding to the sample sentences included in the sample sentence subset based on the positions of the target type entities in the sample sentences.
In an exemplary embodiment of the present disclosure, the second obtaining module includes: the second extraction unit is used for extracting an initial sample sentence set from a preset sample text; a first determining unit, configured to determine, from the initial sample statement set, a sample statement including a preset category of target type entities, and determine a sample statement including a non-preset category of target type entities; the second determining unit is used for determining a statement pair consisting of a sample statement where the preset type target type entity is located and the preset type target type entity as a positive sample statement and generating positive sample marking information representing the preset type target type entity; and the third determining unit is used for determining a statement pair consisting of the sample statement where the target type entity of the non-preset category is located and the target type entity of the non-preset category as a negative sample statement, and generating negative sample marking information representing the target type entity of the non-preset category.
In an exemplary embodiment of the present disclosure, the first determination unit includes: the first determining subunit is used for determining a target type entity from the sample sentences in the initial sample sentence set by using the target type entity recognition model; the comparison subunit is used for comparing each determined target type entity with a preset type target type entity to obtain a non-preset type target type entity; and the second determining subunit is used for determining the sample statement where the target type entity of the non-preset type is located as the sample statement including the target type entity of the non-preset type.
In an exemplary embodiment of the present disclosure, the first extraction unit is further configured to: determining a target type entity from the sample sentences included in the first sample sentence set by using at least one of the following modes: determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; and secondly, searching the target type entity from the sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
In an exemplary embodiment of the present disclosure, the apparatus further includes: and the preprocessing module is used for preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
According to a fourth aspect of the present disclosure, there is provided an entity extraction apparatus including: the acquisition module is used for acquiring a text to be recognized; the training module is used for inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by training in advance based on the method of the first aspect; the determining module is used for determining sentences to be classified from the texts to be recognized based on the target type entity; and the input module is used for inputting the statement to be classified into the pre-trained entity classification model to obtain entity class information representing the class of the target type entity, wherein the entity classification model is obtained by training in advance based on the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the above-described method via execution of the executable instructions.
According to a sixth aspect of the present disclosure, there is provided a computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method described above.
According to a seventh aspect of the present disclosure, there is provided a computer program comprising computer readable code which, when run on a device, executes instructions for implementing the steps of the above method.
As can be seen from the foregoing technical solutions, the entity identification model generation method and apparatus, the entity extraction method and apparatus, the electronic device, the computer-readable storage medium, and the computer program in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:
in the embodiment of the disclosure, by obtaining a first sample statement set and a second sample statement set, training a target type entity recognition model based on the first sample statement set and training an entity classification model based on the second sample statement set by using a machine learning method, a model with high extraction accuracy and classification accuracy can be obtained. Compared with the existing regular expression-based entity extraction technology, the scheme provided by the embodiment of the disclosure has the advantages of low maintenance cost and good flexibility. Compared with the existing automatic entity extraction technology, the extraction granularity of the scheme provided by the embodiment of the disclosure is finer, and the accuracy is higher.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 is a system diagram to which the present disclosure is applicable;
FIG. 2 is a schematic flow chart diagram illustrating a method for generating an entity recognition model according to an exemplary embodiment of the present disclosure;
FIG. 3 is a schematic flow chart diagram of a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 4 is a schematic flow chart diagram illustrating a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 5 is a schematic flow chart diagram of a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating an entity extraction method according to another exemplary embodiment of the disclosure;
FIG. 7 is a schematic structural diagram of an entity recognition model generation apparatus according to an exemplary embodiment of the present disclosure;
FIG. 8 is a schematic structural diagram of an entity recognition model generation apparatus according to another exemplary embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an entity extraction apparatus provided in an exemplary embodiment of the present disclosure;
fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, apparatus, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. The symbol "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the present disclosure, unless otherwise expressly specified or limited, the terms "connected" and the like are to be construed broadly, e.g., as meaning electrically connected or in communication with each other; may be directly connected or indirectly connected through an intermediate. The specific meaning of the above terms in the present disclosure can be understood by those of ordinary skill in the art as appropriate.
Exemplary System
Fig. 1 is a schematic diagram illustrating a system architecture 100 to which the entity recognition model generation method and apparatus, and the entity extraction method and apparatus according to the embodiments of the present disclosure can be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as a text processing application, a search-type application, a web browser application, a shopping-type application, an instant messaging tool, and the like.
The terminal devices 101, 102, 103 may be a variety of electronic devices including, but not limited to, smart phones, tablets, portable and desktop computers, digital cinema projectors, and the like.
The server 105 may be a server that provides various services. For example, the user sends various categories of text to the server 105 using the terminal device 103 (which may be the terminal device 101 or 102). The background text server can perform model training by using the obtained sample text, and can also extract a target type entity from the received text by using the trained model.
Exemplary method
Referring to fig. 2, a flowchart of an entity recognition model generation method provided in an exemplary embodiment of the present disclosure, where the present embodiment may be applied to an electronic device (such as the terminal devices 101, 102, 103 or the server 105 shown in fig. 1), includes the following steps:
s210, a first sample statement set is obtained.
S220, training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model.
And S230, acquiring a second sample statement set.
S240, training an initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set, and the negative sample sentences and the corresponding negative sample marking information to obtain an entity classification model.
In the entity recognition model generation method provided by the embodiment of the present disclosure, the first sample sentence set and the second sample sentence set are obtained, the machine learning method is used, the target type entity recognition model is trained based on the first sample sentence set, and the entity classification model is trained based on the second sample sentence set, so that a model with high extraction accuracy and classification accuracy can be obtained, and the trained target type entity recognition model and entity classification model are used to accurately extract the target type entity from the text and accurately classify the target type entity.
In S210, the electronic device may obtain the first set of sample statements, either locally or remotely. And the sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities.
The source of the first set of sample statements may include a variety, for example, statements extracted from a predetermined text, or statements extracted from a database, such as a MSRA data set.
The target type entity may be various types of entities, for example, the target type entity may be an organizational entity (e.g., a business name, etc.). The entity annotation information is used for indicating a target type entity in the sample statement.
In general, the entity tagging information may be information tagged with an existing BIO tagging method. As an example, a sample sentence is "customer of this bidding work is Beijing XX science and technology Co., Ltd, and bidding agent is Beijing YY science and technology Co., Ltd", and the corresponding entity label information is "O O O O O O O O O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I". Where O represents that the corresponding character is a non-target type entity (here, an organizational entity) character, B-ORG represents that the corresponding character is the first character of the organizational entity, and I-ORG represents that the corresponding character is a non-first character of the organizational entity.
In this embodiment, the entity tagging information may be automatically generated by the electronic device or may be manually set.
In S220, the electronic device may train the initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity tagging information, so as to obtain the target type entity recognition model.
Specifically, the initial target type entity recognition model may be constructed based on existing neural network models of various structures. For example, a model constructed by combining a Conditional Random Field (CRF) with an ernie (enhanced Language Representation with information entities) model and a bert (bidirectional Encoder Representation from transformations) model can be used. As another example, an initial target-type entity recognition model may be constructed based on RoBERTA, XLNET, and the like.
The electronic device may use a machine learning method to input the sample sentences in the acquired first sample sentence set, output entity tagging information corresponding to the input sample sentences as expected, train the initial target type entity recognition model, and obtain actual output for each training of the input sample sentences. And the actual output is data actually output by the initial target type entity recognition model and is used for representing entity marking information. Then, the electronic device may adjust parameters of the initial target type entity recognition model based on actual output and expected output by using a gradient descent method and a back propagation method, use the model obtained after each parameter adjustment as the initial target type entity recognition model for the next training, and end the training when a preset training end condition is met, thereby obtaining the target type entity recognition model through training.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated loss values converge using a preset loss function (e.g., a cross-entropy loss function).
In S230, the electronic device may obtain the second set of sample statements either locally or remotely. The second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-type target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-type target type entities and have corresponding negative sample marking information.
The preset category may include at least one category. For example, when the target type entity is an organization entity, the preset category may include a customer and a supplier, and correspondingly, the non-preset category includes a non-customer and a non-supplier.
The positive sample marking information is used for indicating that the positive sample statement comprises a target type entity of a preset category, and the negative sample marking information is used for indicating that the negative sample statement comprises a target type entity of a non-preset category. As an example, when the preset category includes a customer and a supplier, the number 0 indicates that the category of the target type entity is the customer, the number 1 indicates that the category of the target type entity is the supplier, and the number 2 indicates that the category of the target type entity is the non-customer and the non-supplier.
Optionally, the preset category of the target type entity included in the positive sample statement and the preset category of the target type entity included in the negative sample statement may be marked by a mark indicating a position of the target type entity in the sample statement, so that the electronic device may determine the position of the preset category of the target type entity or the non-preset category of the target type entity from the sample statement.
Optionally, the positive sample statement and the negative sample statement may be composed of a statement pair, the positive sample statement is composed of a statement and a preset type of target type entity, and the negative sample statement is composed of a statement and a non-preset type of target type entity. For example, a positive sample statement includes the following statement pairs: "the customer of this bidding work is Beijing XX technology Co., Ltd" - "Beijing XX technology Co., Ltd".
In S240, the electronic device may train the initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set, and the negative sample sentences and the corresponding negative sample labeling information, to obtain an entity classification model.
Specifically, the initial entity classification model may be constructed based on existing neural network models of various structures. For example, a classification model for classifying the target type entity is constructed as an initial entity classification model using ERNIE. For another example, the initial entity classification model may include a model such as word2vec for determining feature vectors of a sentence, and may further include a classifier such as SVM, XGBoost, ensemble learning model (e.g., stacking), etc. for classifying the feature vectors.
It should be noted that the entity classification model may be a two-class model or a multi-class model. The multi-classification model may directly determine the class of the target type entity in the input sentence. The number of the two classification models may be at least one, for example, when the number of the preset categories is multiple, a two classification model corresponding to each preset category may be trained, and each two classification model is used to determine whether a target type entity in the input sentence is a corresponding preset category.
The electronic device may use a machine learning method to train the initial entity classification model by taking a positive sample sentence in the obtained second sample sentence set as an input, taking positive sample labeling information corresponding to the input positive sample sentence as an expected output, taking a negative sample sentence as an input, and taking negative sample labeling information corresponding to the input negative sample sentence as an expected output. For each sample sentence of training input, actual output can be obtained. And the actual output is data actually output by the initial entity classification model and is used for representing the positive sample labeling information or the negative sample labeling information. Then, the electronic device may adjust parameters of the initial entity classification model based on the actual output and the expected output by using a gradient descent method and a back propagation method, take the model obtained after each parameter adjustment as the initial entity classification model for the next training, and end the training under the condition that a preset training end condition is met, thereby obtaining the entity classification model through training.
It should be noted that the preset training end condition may include, but is not limited to, at least one of the following: the training time exceeds the preset time; the training times exceed the preset times; the calculated loss values converge using a preset loss function (e.g., a cross-entropy loss function).
In some alternative implementations, as shown in fig. 3, step S210 includes the following sub-steps:
s2101, a first sample statement set is extracted from a preset sample text, and a target type entity is determined from sample statements included in the first sample statement set.
Wherein the number of the preset sample texts may be at least one. The preset sample text may be various types of text, for example, the preset sample text may include a plurality of bidding documents acquired from a bidding website. As another example, the predetermined sample text may also include text obtained from a predetermined data set (e.g., MSRA data set). Generally, the electronic device can randomly extract a plurality of sentences from a preset sample text to form a first sample sentence set.
In this step, the method for determining the target type entity in the sample statement may include various methods.
In some alternative implementations, the electronic device can determine the target type entity from the sample statements included in the first set of sample statements using at least one of:
in a first mode, based on a preset regular expression, a target type entity is determined from sample sentences included in a first sample sentence set.
By way of example, for an organizational entity whose category is a customer, the regular expression may be: (. Similarly, using a plurality of regular expressions, organizational entities for each category may be determined from each sample statement included in the first set of sample statements.
And secondly, searching the target type entity from the sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
The prefix dictionary tree may be constructed in advance by using a database in which the target type entities are stored, and the electronic device may search the target type entities included in the prefix dictionary tree from each sample statement included in the first sample statement set.
It should be noted that the target type entity may be determined by using the first or second method, or the target type entity may be determined by using the first and second methods at the same time, and a union of the target type entities extracted by the two methods is taken.
The two ways for extracting the entity provided by the implementation way can realize that the electronic equipment can automatically extract the target type entity from the sample statement included in the first sample statement set, and the two ways can be combined to enable the extracted target type entity to be more comprehensive.
S2102, based on the position of the target type entity in the sample statement, generating entity tagging information corresponding to the sample statement included in the sample statement subset.
The entity tagging information may indicate a position of the target type entity in the sample statement, and specifically refer to the example in step S210.
In the embodiment corresponding to fig. 3, the electronic device automatically extracts the first sample sentence set from the preset sample text, determines the target type entity from the sample sentence, and generates the entity tagging information, so that compared with a manual tagging method, the efficiency of tagging the sample sentence can be improved.
In some alternative implementations, as shown in fig. 4, the above S230 includes the following sub-steps to obtain the positive sample statements and the negative sample statements included in the second sample statement set:
s2301, extracting an initial sample sentence set from the preset sample text.
The preset sample text may be the same as or different from the preset sample text described in S2101. The number of preset sample texts of this step may be at least one. The preset sample text may be various types of text, for example, the preset sample text may include a plurality of bidding documents acquired from a bidding website. Generally, the electronic device may randomly extract a plurality of sentences from a preset sample text to form an initial sample sentence set.
S2302, determining a sample sentence including a preset category of target type entities from the initial sample sentence set, and determining a sample sentence including a non-preset category of target type entities.
As an example, the electronic device may determine the target type entities in the preset category from the sample sentences included in the initial sample sentence set based on a regular expression corresponding to the preset category, and may also determine the target type entities in the non-preset category from the sample sentences included in the initial sample sentence set based on a regular expression corresponding to the preset non-preset category.
And S2303, determining a statement pair consisting of a sample statement where the preset type target type entity is located and the preset type target type entity as a positive sample statement, and generating positive sample marking information representing the preset type target type entity.
As an example, when the preset category includes a customer and a supplier, the number 0 indicates that the category of the target type entity is the customer, the number 1 indicates that the category of the target type entity is the supplier, and the number 2 indicates that the category of the target type entity is the non-customer and the non-supplier. A positive sample statement includes a statement pair of: "the customer of this bidding job is Beijing XX technology Co., Ltd", and the bidding agent is Beijing YY technology Co., Ltd "-" Beijing XX technology Co., Ltd ". The positive sample marking information corresponding to the positive sample statement is a number 0.
S2304, determining a statement pair consisting of a sample statement where the target type entity of the non-preset category is located and the target type entity of the non-preset category as a negative sample statement, and generating negative sample labeling information representing the target type entity of the non-preset category.
By way of example, a negative example statement comprises a statement pair of: "the customer of this bidding job is Beijing XX technology Co., Ltd, and the bidding agent is Beijing YY technology Co., Ltd" - "Beijing YY technology Co., Ltd". The negative sample marking information corresponding to the negative sample statement is number 2.
In the embodiment corresponding to fig. 4, the electronic device automatically extracts the initial sample sentence set from the preset sample text, determines the positive sample sentences and the negative sample sentences from each sample sentence, and generates the positive sample labeling information and the negative sample labeling information.
In some optional implementations, as shown in fig. 5, in S2302, the following sub-steps may be included to determine a sample statement including a target type entity of a non-preset category from the initial sample statement set:
s23021, determining the target type entity from the sample sentences in the initial sample sentence set by using the target type entity recognition model.
The target type entity recognition model is the model obtained by the training in step S220. The model is obtained by training a large number of training samples by using a machine learning method, so that the identification accuracy is high. The electronic device may sequentially input each statement in the initial sample statement set into the target type entity recognition model, so as to obtain a target type entity included in each input sample statement.
S23022, comparing each determined target type entity with a preset type target type entity to obtain a non-preset type target type entity.
Specifically, a difference set may be taken between a set composed of each target type entity output by the model and a set composed of target type entities of the preset category determined in the embodiment corresponding to fig. 4, so as to obtain a target type entity of a non-preset type.
S23023, determine the sample statement where the target type entity of the non-preset type is located as a sample statement including the target type entity of the non-preset type.
The method described in the embodiment of fig. 5 can comprehensively and accurately determine the target type entity from the sample sentences in the initial sample sentence set by using the trained target type entity recognition model, and thus can comprehensively and accurately generate the negative sample sentences.
In some optional implementations, before S210, the electronic device may further perform the following steps:
and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
In particular, the electronic device may replace or remove characters that interfere with model training. As an example, the preset initial text may be a text in html format, and the electronic device may replace an html tag included therein with a space or an empty character, replace carriage return with a space, remove an excess space character, replace an english punctuation mark with a chinese punctuation mark, and the like.
According to the embodiment, the preset initial text is preprocessed in advance to obtain the preset sample text, so that the format of the preset sample text can meet the requirement of model training, the interference of some unnecessary characters on the model training is reduced, and the accuracy of model identification is improved.
With continuing reference to fig. 6, a flowchart of an entity extraction method provided in an exemplary embodiment of the present disclosure is applied to an electronic device (such as the terminal device 101, 102, 103 or the server 105 shown in fig. 1), and the method includes the following steps:
s610, acquiring a text to be recognized.
And S620, inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity.
S630, determining the sentence to be classified from the text to be recognized based on the target type entity.
And S640, inputting the statement to be classified into a pre-trained entity classification model to obtain entity class information representing the class of the target type entity.
According to the entity extraction method provided by the embodiment of the disclosure, the target type entity is accurately and efficiently extracted from the text to be recognized and the category of the target type entity is determined by using the target type entity recognition model and the entity classification model obtained by training corresponding to the embodiment of fig. 2.
In S610, the electronic device may obtain the text to be recognized locally or remotely. The text to be recognized may be various types of text. For example, the text to be recognized may be a bidding document obtained from a bidding website.
In S620, the electronic device may input the text to be recognized into a pre-trained target type entity recognition model, so as to obtain a target type entity.
The target type entity recognition model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2. Specifically, the electronic device may sequentially input sentences included in the text to be recognized into the target similar entity recognition model, so as to obtain target type entities in each sentence. As an example, when the text to be recognized is a bidding document, the target type entity may be an organizational entity.
In S630, the electronic device may determine a sentence to be classified from the text to be recognized based on the target type entity.
Optionally, the electronic device may use the statement in which the target type entity is located as the statement to be classified and mark the target type entity in the statement to be classified.
In S640, the electronic device may input the sentence to be classified into the pre-trained entity classification model, so as to obtain entity class information representing the class of the target type entity.
The entity classification model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2. The electronic device may sequentially input each statement to be classified (e.g., the statement pair) into the entity classification model, so as to obtain entity category information corresponding to each target type entity. As an example, when the text to be recognized is a bidding document and the target type entity is an organization entity, the entity category information corresponding to a certain target type entity may represent that the category of the target type entity is a client or a provider or a non-client and a non-provider. When the method is applied to the field of bidding, accurate client and supplier information can be automatically extracted from the bidding document, and user experience is improved.
In some optional implementations, the step S610 may include the following sub-steps:
first, the original text is obtained.
The original text may be various types of text, among others. As an example, the original text may be html formatted text obtained from a bidding website.
And then, preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In general, the method of preprocessing may be the same as described above with respect to the alternative implementation of preprocessing in the embodiment corresponding to fig. 2. For example, the electronic device may replace an html tag included in the original text in html format with a space or an empty character, replace carriage return with a space, remove an excess space, replace an english punctuation mark with a chinese punctuation mark, and the like.
In the embodiment, the original text is preprocessed in advance to obtain the text to be recognized, so that the format of the text to be recognized can meet the requirement of model recognition, the interference of some unnecessary characters on the model recognition is reduced, and the precision of entity classification is improved.
In some alternative implementations, the above S630 may be performed as follows:
and forming a statement pair by the statement where the target type entity is located and the target type entity, and determining the statement pair as the statement to be classified.
For example, a sentence "the customer of the bidding job is Beijing XX technology Co., Ltd, and the bidding agent is Beijing YY technology Co., Ltd". Two sentence pairs are obtained, namely "the customer of the bidding job is Beijing XX technology Co., Ltd", "the bidding agent is Beijing YY technology Co., Ltd" - "Beijing XX technology Co., Ltd", and "the customer of the bidding job is Beijing XX technology Co., Ltd", "the bidding agent is Beijing YY technology Co., Ltd" - "Beijing YY technology Co., Ltd". The two sentence pairs are both to-be-classified sentences.
The implementation mode takes the statement pair comprising the statement and the target type entity as the statement to be classified, so that the statement comprising the target type entity can be analyzed more pertinently when the model is used for entity classification, and the accuracy of entity classification can be improved.
Exemplary devices
Fig. 7 schematically shows a structural diagram of a recognition model generation apparatus according to an embodiment of the present disclosure. The recognition model generating apparatus provided in the embodiment of the present disclosure may be disposed on a terminal device, or may be disposed on a server, or may be partially disposed on a terminal device, or partially disposed on a server, for example, may be disposed on the server 105 in fig. 1 (according to actual replacement), but the present disclosure is not limited thereto.
The recognition model generation device provided by the embodiment of the present disclosure may include: a first obtaining module 710, configured to obtain a first sample statement set, where a sample statement in the first sample statement set includes a target type entity and has corresponding entity tagging information representing the target type entity; a first training module 720, configured to train an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity tagging information, to obtain a target type entity recognition model; a second obtaining module 730, configured to obtain a second sample statement set, where the second sample statement set includes a positive sample statement and a negative sample statement, the positive sample statement includes a preset type of target type entity and has corresponding positive sample labeling information, and the negative sample statement includes a non-preset type of target type entity and has corresponding negative sample labeling information; the second training module 740 is configured to train the initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set, and the negative sample sentences and the corresponding negative sample labeling information, to obtain an entity classification model.
In this embodiment, the first obtaining module 710 may obtain the first sample statement set locally or remotely. And the sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities.
The source of the first set of sample statements may include a variety, for example, statements extracted from a predetermined text, or statements extracted from a database, such as a MSRA data set.
The target type entity may be various types of entities, for example, the target type entity may be an organizational entity (e.g., a business name, etc.). The entity annotation information is used for indicating a target type entity in the sample statement.
In this embodiment, the entity tagging information may be automatically generated by the apparatus, or may be manually set.
In this embodiment, the first training module 720 may train the initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity tagging information, so as to obtain the target type entity recognition model.
Specifically, the initial target type entity recognition model may be constructed based on existing neural network models of various structures. For example, a model constructed by combining a Conditional Random Field (CRF) with an ernie (enhanced Language Representation with information entities) model and a bert (bidirectional Encoder Representation from transformations) model can be used. As another example, an initial target-type entity recognition model may be constructed based on RoBERTA, XLNET, and the like.
In this embodiment, the second obtaining module 730 may obtain the second sample statement set locally or remotely. The second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-type target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-type target type entities and have corresponding negative sample marking information.
The preset category may include at least one category. For example, when the target type entity is an organization entity, the preset category may include a customer and a supplier, and correspondingly, the non-preset category includes a non-customer and a non-supplier.
The positive sample marking information is used for indicating that the positive sample statement comprises a target type entity of a preset category, and the negative sample marking information is used for indicating that the negative sample statement comprises a target type entity of a non-preset category. As an example, when the preset category includes a customer and a supplier, the number 0 indicates that the category of the target type entity is the customer, the number 1 indicates that the category of the target type entity is the supplier, and the number 2 indicates that the category of the target type entity is the non-customer and the non-supplier.
In this embodiment, the second training module 740 may be constructed based on existing neural network models of various structures. For example, a classification model for classifying the target type entity is constructed as an initial entity classification model using ERNIE. For another example, the initial entity classification model may include a model such as word2vec for determining feature vectors of a sentence, and may further include a classifier such as SVM, XGBoost, ensemble learning model (e.g., stacking), etc. for classifying the feature vectors.
It should be noted that the entity classification model may be a two-class model or a multi-class model. The multi-classification model may directly determine the class of the target type entity in the input sentence. The number of the two classification models may be at least one, for example, when the number of the preset categories is multiple, a two classification model corresponding to each preset category may be trained, and each two classification model is used to determine whether a target type entity in the input sentence is a corresponding preset category.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a data compression apparatus according to another exemplary embodiment of the present disclosure.
In some optional implementations, the first obtaining module 710 includes: the first extraction unit 7101 is used for extracting a first sample statement set from a preset sample text and determining a target type entity from sample statements included in the first sample statement set; a generating unit 7102, configured to generate entity tagging information corresponding to the sample statement included in the sample statement subset, based on a position of the target type entity in the sample statement.
In some optional implementations, the second obtaining module 730 includes: a second extracting unit 7031, configured to extract an initial sample sentence set from a preset sample text; a first determining unit 7302 for determining a sample sentence including a preset category of the target type entity and a sample sentence including a non-preset category of the target type entity from the initial sample sentence set; a second determining unit 7303, configured to determine a sample statement in which the preset-type target entity is located and a statement pair composed of the preset-type target entity as a positive sample statement, and generate positive sample labeling information representing the preset-type target entity; a third determining unit 7304, configured to determine a statement pair composed of a sample statement where the target type entity in the non-preset category is located and the target type entity in the non-preset category as a negative sample statement, and generate negative sample labeling information representing the target type entity in the non-preset category.
In some alternative implementations, the first determining unit 7302 includes: a first determining subunit 73021, configured to determine a target type entity from the sample sentences in the initial sample sentence set by using the target type entity recognition model; a comparison subunit 73022, configured to compare each determined target type entity with a preset type of target type entity, to obtain a non-preset type of target type entity; the second determining subunit 73023 is configured to determine the sample statement in which the target type entity of the non-preset type is located as a sample statement that includes the target type entity of the non-preset category.
In some optional implementations, the first extraction unit 7101 is further configured to: determining a target type entity from the sample sentences included in the first sample sentence set by using at least one of the following modes: determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; and secondly, searching the target type entity from the sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
In some optional implementations, the apparatus further comprises: the preprocessing module 750 is configured to preprocess the preset initial text to obtain a preset sample text conforming to a preset format.
The entity recognition model generation device provided by the embodiment of the disclosure trains the target type entity recognition model based on the first sample sentence set and trains the entity classification model based on the second sample sentence set by using a machine learning method through obtaining the first sample sentence set and the second sample sentence set, so that a model with high extraction accuracy and classification accuracy can be obtained, and the trained target type entity recognition model and entity classification model can be used for accurately extracting the target type entity from the text and accurately classifying the target type entity.
The specific implementation of each module, unit and subunit in the entity identification model generation apparatus provided in the embodiment of the present disclosure may refer to the content in the entity identification model generation method, and is not described herein again.
Fig. 9 schematically shows a structural diagram of an entity extraction apparatus according to an embodiment of the present disclosure. The entity extracting apparatus provided in the embodiment of the present disclosure may be disposed on a terminal device, or may be disposed on a server, or may be partially disposed on the terminal device and partially disposed on the server, for example, may be disposed on the server 105 in fig. 1 (according to actual replacement), but the present disclosure is not limited thereto.
The entity extraction apparatus provided in the embodiments of the present disclosure may include: a third obtaining module 910, configured to obtain a text to be recognized; the recognition module 920 is configured to input the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, where the target type entity recognition model is obtained by training in advance based on the method of the first aspect; a determining module 930, configured to determine a sentence to be classified from the text to be recognized based on the target type entity; the classifying module 940 is configured to input the sentence to be classified into a pre-trained entity classification model to obtain entity class information representing a class of the target type entity, where the entity classification model is obtained by training in advance based on the method of the first aspect. .
In this embodiment, the third obtaining module 910 may obtain the text to be recognized locally or remotely. The text to be recognized may be various types of text. For example, the text to be recognized may be a bidding document obtained from a bidding website.
In this embodiment, the recognition module 920 may input the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity.
The target type entity recognition model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2. Specifically, the training module 920 may sequentially input the sentences included in the text to be recognized into the target similar entity recognition model, so as to obtain target type entities in each sentence. As an example, when the text to be recognized is a bidding document, the target type entity may be an organizational entity.
In this embodiment, the determining module 930 may determine the sentence to be classified from the text to be recognized based on the target type entity.
Optionally, the determining module 930 may take the statement in which the target type entity is located as the statement to be classified and mark the target type entity in the statement to be classified.
In this embodiment, the classification module 940 may input the sentence to be classified into a pre-trained entity classification model to obtain entity class information representing the class of the target type entity.
The entity classification model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2. The classifying module 940 may sequentially input each statement to be classified (e.g., the statement pair) into the entity classification model, so as to obtain entity class information corresponding to each target type entity. As an example, when the text to be recognized is a bidding document and the target type entity is an organization entity, the entity category information corresponding to a certain target type entity may represent that the category of the target type entity is a client or a provider or a non-client and a non-provider. When the method is applied to the field of bidding, accurate client and supplier information can be automatically extracted from the bidding document, and user experience is improved.
In some optional implementations, the third obtaining module 910 is further configured to: acquiring an original text; and preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In some optional implementations, the determining module 930 is further configured to: determining a sentence to be classified from the text to be recognized based on the target type entity, wherein the method comprises the following steps: and forming a statement pair by the statement where the target type entity is located and the target type entity, and determining the statement pair as the statement to be classified.
According to the entity extraction device provided by the embodiment of the disclosure, the target type entity is extracted from the text to be recognized accurately and efficiently and the category of the target type entity is determined by using the target type entity recognition model and the entity classification model obtained by training corresponding to the embodiment of fig. 2.
The specific implementation of each module, unit, and sub-unit in the entity extraction apparatus provided in the embodiment of the present disclosure may refer to the content in the entity extraction method, and will not be described herein again.
It should be noted that although several modules, units and sub-units of the apparatus for action execution are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functionality of two or more modules, units and sub-units described above may be embodied in one module, unit and sub-unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module, unit and sub-unit described above may be further divided into embodiments by a plurality of modules, units and sub-units.
Exemplary electronic device
As shown in FIG. 10, the example electronic device 100 includes a processor 1001 for executing software routines although a single processor is shown for clarity, the electronic device 100 may include a multi-processor system. The processor 1001 is connected to a communication infrastructure 1002 for communicating with other components of the electronic device 100. The communication infrastructure 1002 may include, for example, a communication bus, a crossbar, or a network.
Electronic device 100 also includes Memory, such as Random Access Memory (RAM), which may include a main Memory 1003 and a secondary Memory 1010. The secondary memory 1010 may include, for example, a hard disk drive 1011 and/or a removable storage drive 1012, and the removable storage drive 1012 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, etc. The removable storage drive 1012 reads from and/or writes to a removable storage unit 1013 in a conventional manner. Removable storage unit 1013 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 1012. As will be appreciated by one skilled in the relevant art, the removable storage unit 1013 includes a computer-readable storage medium having stored thereon computer-executable program code instructions and/or data.
In an alternative embodiment, secondary memory 1010 may additionally or alternatively include other similar means for allowing computer programs or other instructions to be loaded into electronic device 100. Such means may include, for example, a removable storage unit 1021 and an interface 1020. Examples of the removable storage unit 1021 and interface 1020 include: a program cartridge and cartridge interface (such as that found in video game console devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units 1021 and interfaces 1020 that allow software and data to be transferred from the removable storage unit 1021 to the electronic device 100.
Electronic device 100 also includes at least one communication interface 1040. Communications interface 1040 allows software and data to be transferred between electronic device 100 and external devices via communications path 1041. In various embodiments of the present disclosure, communication interface 1040 allows data to be transferred between electronic device 100 and a data communication network, such as a public data or private data communication network. The communication interface 1040 may be used to exchange data between different electronic devices 100, which electronic devices 100 form part of an interconnected computer network. Examples of communication interface 1040 may include a modem, a network interface (such as an ethernet card), a communication port, an antenna with associated circuitry, and so forth. Communication interface 1040 may be wired or may be wireless. Software and data transferred via communications interface 1040 are in the form of signals which may be electronic, magnetic, optical or other signals capable of being received by communications interface 1040. These signals are provided to a communications interface via communications path 1041.
As shown in fig. 10, the electronic device 100 also includes a display interface 1031 and an audio interface 1032, the display interface 1031 performing operations for rendering images to an associated display 1030, and the audio interface 1032 for performing operations for playing audio content through an associated speaker 1033.
In this document, the term "computer program product" may refer, in part, to: removable storage unit 1013, removable storage unit 1021, a hard disk installed in hard disk drive 1011, or a carrier wave carrying software over communications path 1041 (wireless link or cable) to communications interface 1040. Computer-readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to electronic device 100 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROMs, DVDs, Blu-ray (TM) disks, hard disk drives, ROMs, or integrated circuits, USB memory, magneto-optical disks, or a computer-readable card, such as a PCMCIA card, etc., whether internal or external to electronic device 100. Transitory or non-tangible computer-readable transmission media may also participate in providing software, applications, instructions, and/or data to the electronic device 100, examples of such transmission media include radio or infrared transmission channels, network connections to another computer or another networked device, and the internet or intranet that includes e-mail transmissions and information recorded on websites and the like.
Computer programs (also called computer program code) are stored in the main memory 1003 and/or the secondary memory 1010. Computer programs may also be received via communications interface 1040. Such computer programs, when executed, enable electronic device 100 to perform one or more features of embodiments discussed herein. In various embodiments, the computer programs, when executed, enable the processor 1001 to perform the features of the embodiments described above. Accordingly, such computer programs represent controllers of the electronic device 100.
The software may be stored in a computer program product and loaded into electronic device 100 using removable storage drive 1012, hard drive 1011, or interface 1020. Alternatively, the computer program product may be downloaded to electronic device 100 via communications path 1041. The software, when executed by the processor 1001, causes the electronic device 100 to perform the functions of the embodiments described herein.
It should be understood that the embodiment of fig. 10 is given by way of example only. Thus, in some embodiments, one or more features of electronic device 100 may be omitted. Also, in some embodiments, one or more features of electronic device 100 may be combined together. Additionally, in some embodiments, one or more features of electronic device 100 may be separated into one or more components.
It will be appreciated that the elements shown in fig. 10 serve to provide a means for performing the various functions and operations of the server described in the above embodiments.
In one embodiment, a server may be generally described as a physical device including at least one processor and at least one memory including computer program code. The at least one memory and the computer program code are configured to, with the at least one processor, cause the physical device to perform necessary operations.
Exemplary computer readable storage Medium
Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the functions of the method shown in fig. 2-6.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by an electronic device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
Exemplary computer program
The embodiments of the present disclosure also provide a computer program product for storing computer readable instructions, which when executed, cause a computer to execute the entity identification model generation method or the entity extraction method in any possible implementation manner.
The computer program product may be embodied in hardware, software or a combination thereof. In one alternative, the computer program product is embodied in a computer storage medium, and in another alternative, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (14)

1. A method for generating an entity recognition model, comprising:
obtaining a first sample statement set, wherein sample statements in the first sample statement set comprise target type entities and have corresponding entity labeling information representing the target type entities;
training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model;
acquiring a second sample statement set, wherein the second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-class target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-class target type entities and have corresponding negative sample marking information;
and training an initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set and the negative sample sentences and the corresponding negative sample marking information to obtain an entity classification model.
2. The method of claim 1, wherein obtaining the first set of sample statements comprises:
extracting a first sample statement set from a preset sample text, and determining a target type entity from sample statements included in the first sample statement set;
and generating entity labeling information corresponding to the sample sentences included in the sample sentence subset based on the positions of the target type entities in the sample sentences.
3. The method of claim 1, wherein obtaining the second set of sample statements comprises:
extracting an initial sample statement set from a preset sample text;
determining sample sentences comprising preset categories of target type entities from the initial sample sentence set, and determining sample sentences comprising non-preset categories of target type entities;
determining a statement pair consisting of a sample statement where the preset type of target type entity is located and the preset type of target type entity as a positive sample statement, and generating positive sample marking information representing the preset type of target type entity;
determining a statement pair consisting of a sample statement where the target type entity of the non-preset category is located and the target type entity of the non-preset category as a negative sample statement, and generating negative sample labeling information representing the target type entity of the non-preset category.
4. The method of claim 3, wherein determining the sample statement that includes the target type entity of the non-preset category comprises:
determining a target type entity from the sample sentences in the initial sample sentence set by using the target type entity recognition model;
comparing each determined target type entity with the target type entities of the preset type to obtain target type entities of non-preset types;
and determining the sample statement where the target type entity of the non-preset type is located as the sample statement comprising the target type entity of the non-preset type.
5. The method of claim 2, wherein the determining a target type entity from the sample statements included in the first set of sample statements comprises:
determining a target type entity from the sample statements included in the first sample statement set by using at least one of the following ways:
determining a target type entity from the sample sentences included in the first sample sentence set based on a preset regular expression;
and secondly, searching the target type entity from the sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
6. The method according to one of claims 2-5, wherein prior to said extracting the first set of sample sentences from the preset sample text, the method further comprises:
and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
7. An entity extraction method, comprising:
acquiring a text to be identified;
inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by pre-training based on the method of one of claims 1 to 6;
determining sentences to be classified from the texts to be recognized based on the target type entities;
inputting the statement to be classified into a pre-trained entity classification model to obtain entity class information representing the class of the target type entity, wherein the entity classification model is obtained by pre-training based on the method of one of claims 1 to 6.
8. The method of claim 7, wherein the obtaining the text to be recognized comprises:
acquiring an original text;
and preprocessing the original text to obtain the text to be recognized which accords with a preset format.
9. The method of claim 7, wherein the determining a sentence to be classified from the text to be recognized based on the target type entity comprises:
and forming a statement pair by the statement where the target type entity is located and the target type entity, and determining the statement pair as the statement to be classified.
10. An entity recognition model generation apparatus, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a first sample statement set, and sample statements in the first sample statement set comprise target type entities and have corresponding entity marking information for representing the target type entities;
the first training module is used for training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity marking information to obtain a target type entity recognition model;
the second obtaining module is used for obtaining a second sample statement set, wherein the second sample statement set comprises positive sample statements and negative sample statements, the positive sample statements comprise preset-class target type entities and have corresponding positive sample marking information, and the negative sample statements comprise non-preset-class target type entities and have corresponding negative sample marking information;
and the second training module is used for training an initial entity classification model based on the positive sample sentences and the corresponding positive sample marking information in the second sample sentence set, and the negative sample sentences and the corresponding negative sample marking information to obtain an entity classification model.
11. An entity extraction apparatus, comprising:
the third acquisition module is used for acquiring the text to be recognized;
the recognition module is used for inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by pre-training based on the method of one of claims 1 to 6;
the determining module is used for determining sentences to be classified from the texts to be recognized based on the target type entity;
a classification module, configured to input the sentence to be classified into a pre-trained entity classification model, so as to obtain entity class information representing a class of the target type entity, where the entity classification model is obtained by being trained in advance based on the method according to any one of claims 1 to 6.
12. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any one of claims 1-9 via execution of the executable instructions.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-9.
14. A computer program comprising computer readable code for, when run on a device, executing instructions for implementing the steps of the method according to any one of claims 1 to 9 by a processor in the device.
CN202110208364.9A 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device Active CN113010638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110208364.9A CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110208364.9A CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Publications (2)

Publication Number Publication Date
CN113010638A true CN113010638A (en) 2021-06-22
CN113010638B CN113010638B (en) 2024-02-09

Family

ID=76385896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110208364.9A Active CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Country Status (1)

Country Link
CN (1) CN113010638B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626592A (en) * 2021-07-08 2021-11-09 中汽创智科技有限公司 Corpus-based classification method and device, electronic equipment and storage medium
CN114254109A (en) * 2021-12-15 2022-03-29 北京金堤科技有限公司 Method and device for determining industry category
CN114611497A (en) * 2022-05-10 2022-06-10 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN115329756A (en) * 2021-10-21 2022-11-11 盐城金堤科技有限公司 Execution subject extraction method and device, storage medium and electronic equipment
WO2023280106A1 (en) * 2021-07-06 2023-01-12 北京有竹居网络技术有限公司 Information acquisition method and apparatus, device, and medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170300565A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation System and method for entity extraction from semi-structured text documents
US20180173694A1 (en) * 2016-12-21 2018-06-21 Industrial Technology Research Institute Methods and computer systems for named entity verification, named entity verification model training, and phrase expansion
CN109145153A (en) * 2018-07-02 2019-01-04 北京奇艺世纪科技有限公司 It is intended to recognition methods and the device of classification
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
US20200073933A1 (en) * 2018-08-29 2020-03-05 National University Of Defense Technology Multi-triplet extraction method based on entity-relation joint extraction model
CN111739520A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Speech recognition model training method, speech recognition method and device
CN111813954A (en) * 2020-06-28 2020-10-23 北京邮电大学 Method and device for determining relationship between two entities in text statement and electronic equipment
US20200364406A1 (en) * 2019-05-17 2020-11-19 Baidu Online Network Technology (Beijing) Co., Ltd Entity relationship processing method, apparatus, device and computer readable storage medium
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170300565A1 (en) * 2016-04-14 2017-10-19 Xerox Corporation System and method for entity extraction from semi-structured text documents
US20180173694A1 (en) * 2016-12-21 2018-06-21 Industrial Technology Research Institute Methods and computer systems for named entity verification, named entity verification model training, and phrase expansion
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning
CN109145153A (en) * 2018-07-02 2019-01-04 北京奇艺世纪科技有限公司 It is intended to recognition methods and the device of classification
US20200073933A1 (en) * 2018-08-29 2020-03-05 National University Of Defense Technology Multi-triplet extraction method based on entity-relation joint extraction model
US20200364406A1 (en) * 2019-05-17 2020-11-19 Baidu Online Network Technology (Beijing) Co., Ltd Entity relationship processing method, apparatus, device and computer readable storage medium
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
CN111813954A (en) * 2020-06-28 2020-10-23 北京邮电大学 Method and device for determining relationship between two entities in text statement and electronic equipment
CN111739520A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Speech recognition model training method, speech recognition method and device
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
N. KANYA等: "Modelings and techniques in Named Entity Recognition-an Information Extraction task", IET CHENNAI 3RD INTERNATIONAL ON SUSTAINABLE ENERGY AND INTELLIGENT SYSTEMS (SEISCON 2012), pages 1 - 5 *
付皓;: "基于Bert, ReZero和CRF的中文命名实体识别", 电脑编程技巧与维护, no. 06, pages 12 - 13 *
姜维;王晓龙;关毅;赵健;: "基于多知识源的中文词法分析系统", 计算机学报, no. 01, pages 139 - 147 *
孙誉侨: "面向招标数据的命名实体识别方法研究及应用", 中国优秀硕士学位论文全文数据库信息科技辑, no. 1, pages 138 - 2140 *
祖木然提古丽·库尔班: "基于神经网络的电子病历实体识别", 中国优秀硕士学位论文全文数据库医药卫生科技辑, no. 1, pages 053 - 323 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023280106A1 (en) * 2021-07-06 2023-01-12 北京有竹居网络技术有限公司 Information acquisition method and apparatus, device, and medium
CN113626592A (en) * 2021-07-08 2021-11-09 中汽创智科技有限公司 Corpus-based classification method and device, electronic equipment and storage medium
CN115329756A (en) * 2021-10-21 2022-11-11 盐城金堤科技有限公司 Execution subject extraction method and device, storage medium and electronic equipment
CN114254109A (en) * 2021-12-15 2022-03-29 北京金堤科技有限公司 Method and device for determining industry category
CN114254109B (en) * 2021-12-15 2023-09-19 北京金堤科技有限公司 Method and device for determining industry category
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN114611497A (en) * 2022-05-10 2022-06-10 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment

Also Published As

Publication number Publication date
CN113010638B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN113010638B (en) Entity recognition model generation method and device and entity extraction method and device
CN109635103B (en) Abstract generation method and device
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN107301163B (en) Formula-containing text semantic parsing method and device
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN113961685A (en) Information extraction method and device
CN110222168B (en) Data processing method and related device
CN113360699A (en) Model training method and device, image question answering method and device
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN113469298A (en) Model training method and resource recommendation method
CN114254077A (en) Method for evaluating integrity of manuscript based on natural language
CN113609390A (en) Information analysis method and device, electronic equipment and computer readable storage medium
CN115659969B (en) Document labeling method, device, electronic equipment and storage medium
CN115130437B (en) Intelligent document filling method and device and storage medium
CN111199151A (en) Data processing method and data processing device
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN114218381B (en) Method, device, equipment and medium for identifying position
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN110889289B (en) Information accuracy evaluation method, device, equipment and computer readable storage medium
CN113887244A (en) Text processing method and device
CN113591467B (en) Event main body recognition method and device, electronic equipment and medium
CN113609391B (en) Event recognition method and device, electronic equipment, medium and program
CN118070209A (en) Multi-mode data processing method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant