CN113010638B - Entity recognition model generation method and device and entity extraction method and device - Google Patents

Entity recognition model generation method and device and entity extraction method and device Download PDF

Info

Publication number
CN113010638B
CN113010638B CN202110208364.9A CN202110208364A CN113010638B CN 113010638 B CN113010638 B CN 113010638B CN 202110208364 A CN202110208364 A CN 202110208364A CN 113010638 B CN113010638 B CN 113010638B
Authority
CN
China
Prior art keywords
sample
target type
entity
preset
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110208364.9A
Other languages
Chinese (zh)
Other versions
CN113010638A (en
Inventor
李凯
周晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Credit Service Co ltd
Original Assignee
Beijing Jindi Credit Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Credit Service Co ltd filed Critical Beijing Jindi Credit Service Co ltd
Priority to CN202110208364.9A priority Critical patent/CN113010638B/en
Publication of CN113010638A publication Critical patent/CN113010638A/en
Application granted granted Critical
Publication of CN113010638B publication Critical patent/CN113010638B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure provide an entity recognition model generation method and apparatus, an entity extraction method and apparatus, a computer-readable storage medium, an electronic device, and a computer program. The method comprises the following steps: acquiring a first sample sentence set; training an initial target type entity recognition model based on sample sentences in the first sample sentence set and corresponding entity labeling information to obtain a target type entity recognition model; acquiring a second sample sentence set; based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information, training an initial entity classification model to obtain an entity classification model. According to the technical scheme, the automatic extraction of the target type entity can be realized, and the target type entity and the category of the target type entity in the text can be accurately and comprehensively obtained.

Description

Entity recognition model generation method and device and entity extraction method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for generating an entity recognition model, a method and apparatus for extracting an entity, an electronic device, a computer readable storage medium, and a computer program.
Background
Named entity recognition (Named Entity Recognition, NER) refers to the process of recognizing a specific object transaction name or symbol from text. Named entity recognition technology is an indispensable component in various natural language processing tasks such as information extraction, information retrieval, machine translation, question-answering systems and the like.
Currently, in many fields, it is necessary to extract a specific type of entity from an existing text, and provide various services to a user. For example, parsing public information of bidding websites can provide valuable information for businesses and users. The information in the bidding field is complex, and the extraction method widely adopted at present realizes information extraction by constructing a corresponding regular expression template. Still other automated extraction techniques such as web page information extraction based on web page structural features, web page information extraction based on wrapper induction, etc.
Disclosure of Invention
The present disclosure is directed to a method and apparatus for generating an entity recognition model, a method and apparatus for extracting an entity, an electronic device, a computer-readable storage medium, and a computer program, and further, at least to some extent, solve the technical problems described in the background art.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to a first aspect of the present disclosure, there is provided an entity recognition model generation method, including: acquiring a first sample sentence set, wherein sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities; training an initial target type entity recognition model based on sample sentences in the first sample sentence set and corresponding entity labeling information to obtain a target type entity recognition model; acquiring a second sample sentence set, wherein the second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information; based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information, training an initial entity classification model to obtain an entity classification model.
In an exemplary embodiment of the present disclosure, obtaining a first set of sample statements includes: extracting a first sample sentence set from a preset sample text, and determining a target type entity from sample sentences included in the first sample sentence set; and generating entity labeling information corresponding to the sample sentences included in the sample sentence subsets based on the positions of the target type entities in the sample sentences.
In an exemplary embodiment of the present disclosure, obtaining a second set of sample statements includes: extracting an initial sample sentence set from a preset sample text; determining sample sentences including target type entities of preset categories from an initial sample sentence set, and determining sample sentences including target type entities of non-preset categories; determining statement pairs consisting of sample statements in which target type entities of a preset category are located and the target type entities of the preset category as positive sample statements, and generating positive sample labeling information representing the target type entities of the preset category; and determining statement pairs consisting of sample statements in which target type entities of non-preset categories are located and the target type entities of the non-preset categories as negative sample statements, and generating negative sample labeling information representing the target type entities of the non-preset categories.
In an exemplary embodiment of the present disclosure, determining a sample sentence including a target type entity of a non-preset category includes: determining a target type entity from sample sentences in the initial sample sentence set by using a target type entity identification model; comparing each determined target type entity with a target type entity of a preset category to obtain a target type entity of a non-preset type; and determining the sample sentence of the target type entity of the non-preset type as the sample sentence of the target type entity of the non-preset type.
In an exemplary embodiment of the present disclosure, determining a target type entity from sample statements included in a first set of sample statements includes: determining a target type entity from sample statements included in the first set of sample statements using at least one of: in a first mode, determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; in a second mode, based on a preset prefix dictionary tree constructed by the target type entity, the target type entity is searched from sample sentences included in the first sample sentence set.
In an exemplary embodiment of the present disclosure, before extracting the first set of sample sentences from the preset sample text, the method further comprises: and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
According to a second aspect of the present disclosure, there is provided an entity extraction method comprising: acquiring a text to be identified; inputting a text to be identified into a pre-trained target type entity identification model to obtain a target type entity, wherein the target type entity identification model is obtained by training in advance based on the method of the first aspect; determining sentences to be classified from texts to be recognized based on the target type entity; inputting sentences to be classified into a pre-trained entity classification model to obtain entity class information representing the classes of the target type entities, wherein the entity classification model is obtained by training in advance based on the method of the first aspect.
In an exemplary embodiment of the present disclosure, obtaining text to be recognized includes: acquiring an original text; and preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In an exemplary embodiment of the present disclosure, determining a sentence to be classified from text to be recognized based on a target type entity includes: and forming statement pairs by the statement in which the target type entity is located and the target type entity, and determining the statement pairs as statements to be classified.
According to a third aspect of the present disclosure, there is provided an entity recognition model generating apparatus including: the first acquisition module is used for acquiring a first sample sentence set, wherein sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities; the first training module is used for training an initial target type entity identification model based on sample sentences in the first sample sentence set and corresponding entity labeling information to obtain a target type entity identification model; the second acquisition module is used for acquiring a second sample sentence set, wherein the second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information; the second training module is used for training the initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information to obtain an entity classification model.
In an exemplary embodiment of the present disclosure, the first acquisition module includes: the first extraction unit is used for extracting a first sample sentence set from a preset sample text and determining a target type entity from sample sentences included in the first sample sentence set; the generating unit is used for generating entity marking information corresponding to the sample sentences included in the sample sentence subsets based on the positions of the target type entities in the sample sentences.
In an exemplary embodiment of the present disclosure, the second acquisition module includes: the second extraction unit is used for extracting an initial sample sentence set from a preset sample text; a first determining unit configured to determine, from an initial sample sentence set, a sample sentence including a target type entity of a preset category, and determine a sample sentence including a target type entity of a non-preset category; the second determining unit is used for determining statement pairs consisting of sample statements in which target type entities of a preset category are located and the target type entities of the preset category as positive sample statements and generating positive sample labeling information representing the target type entities of the preset category; and the third determining unit is used for determining statement pairs consisting of sample statements where target type entities of non-preset types are located and target type entities of non-preset types as negative sample statements and generating negative sample labeling information representing the target type entities of the non-preset types.
In an exemplary embodiment of the present disclosure, the first determining unit includes: a first determining subunit, configured to determine, using the target type entity recognition model, a target type entity from sample sentences in the initial sample sentence set; the comparison subunit is used for comparing each determined target type entity with a target type entity of a preset category to obtain a target type entity of a non-preset type; and the second determining subunit is used for determining the sample statement of the target type entity with the non-preset type as the sample statement of the target type entity with the non-preset type.
In an exemplary embodiment of the present disclosure, the first extraction unit is further configured to: determining a target type entity from sample statements included in the first set of sample statements using at least one of: in a first mode, determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; in a second mode, based on a preset prefix dictionary tree constructed by the target type entity, the target type entity is searched from sample sentences included in the first sample sentence set.
In an exemplary embodiment of the present disclosure, the apparatus further includes: the preprocessing module is used for preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
According to a fourth aspect of the present disclosure, there is provided an entity extraction apparatus comprising: the acquisition module is used for acquiring the text to be identified; the training module is used for inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by training in advance based on the method of the first aspect; the determining module is used for determining sentences to be classified from the texts to be recognized based on the target type entity; the input module is used for inputting sentences to be classified into a pre-trained entity classification model to obtain entity class information representing the classes of the target type entities, wherein the entity classification model is obtained by training in advance based on the method of the first aspect.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the above-described method via execution of the executable instructions.
According to a sixth aspect of the present disclosure, there is provided a computer storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the above-mentioned method.
According to a seventh aspect of the present disclosure, there is provided a computer program comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for carrying out the steps of the above method.
As can be seen from the above technical solutions, the entity recognition model generating method and apparatus, the entity extracting method and apparatus, the electronic device, the computer readable storage medium and the computer program in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:
the entity recognition model generating method and device, the entity extracting method and device, the electronic equipment, the computer readable storage medium and the computer program in the embodiment of the disclosure, through acquiring a first sample sentence set and a second sample sentence set, training a target type entity recognition model based on the first sample sentence set and training an entity classification model based on the second sample sentence set by using a machine learning method, thereby obtaining a model with high extraction accuracy and classification accuracy, and when extracting a target type entity in a text, automatically extracting the target type entity and accurately and comprehensively obtaining the target type entity and the category of the target type entity in the text by using the target type entity recognition model and the entity classification model. Compared with the existing entity extraction technology based on regular expressions, the scheme provided by the embodiment of the disclosure is low in maintenance cost and good in flexibility. Compared with the existing automatic entity extraction technology, the scheme provided by the embodiment of the disclosure has finer extraction granularity and higher accuracy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 is a system diagram to which the present disclosure is applicable;
FIG. 2 is a flow chart of a method for generating an entity recognition model according to an exemplary embodiment of the present disclosure;
FIG. 3 is a flow chart of a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 4 is a flow chart of a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 5 is a flow chart of a method for generating an entity recognition model according to another exemplary embodiment of the present disclosure;
FIG. 6 is a flow chart of an entity extraction method provided by another exemplary embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a entity recognition model generating apparatus according to an exemplary embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a structure of an entity recognition model generating apparatus according to another exemplary embodiment of the present disclosure;
fig. 9 is a schematic structural view of an entity extraction apparatus according to an exemplary embodiment of the present disclosure;
fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, apparatus, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise. The symbol "/" generally indicates that the context-dependent object is an "or" relationship.
In the present disclosure, unless explicitly specified and limited otherwise, terms such as "connected" and the like are to be construed broadly and, for example, may be electrically connected or may communicate with each other; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.
Exemplary System
Fig. 1 shows a schematic diagram of a system architecture 100 to which the entity recognition model generation method and apparatus, and the entity extraction method and apparatus of the embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection categories such as wired, wireless communication links, or fiber optic cables, among others.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a text processing application, a search class application, a web browser application, a shopping class application, an instant messaging tool, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, digital cinema projectors, and the like.
The server 105 may be a server providing various services. For example, the user transmits various types of text to the server 105 using the terminal device 103 (which may be the terminal device 101 or 102). The background text server can use the acquired sample text to carry out model training, and can also use the trained model to extract target type entities from the received text.
Exemplary method
Referring to fig. 2, a flowchart of a method for generating an entity identification model according to an exemplary embodiment of the present disclosure is provided, where the present embodiment may be applied to an electronic device (such as the terminal device 101, 102, 103 or the server 105 shown in fig. 1), and the method includes the following steps:
s210, acquiring a first sample sentence set.
S220, training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity labeling information to obtain a target type entity recognition model.
S230, acquiring a second sample sentence set.
S240, training an initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information to obtain an entity classification model.
According to the entity recognition model generation method, the first sample sentence set and the second sample sentence set are obtained, the target type entity recognition model is trained based on the first sample sentence set by using the machine learning method, and the entity classification model is trained based on the second sample sentence set, so that a model with high extraction accuracy and classification accuracy can be obtained, and the trained target type entity recognition model and entity classification model can be used for accurately extracting target type entities from texts and accurately classifying the target type entities.
In S210, the electronic device may obtain the first set of sample sentences locally or remotely. The sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities.
The source of the first sample sentence set may include a plurality of sentences, for example, sentences extracted from a preset text, and sentences extracted from a database such as an MSRA data set.
The target type entity may be various types of entities, for example, the target type entity may be an organization entity (e.g., business name, utility name, etc.). The entity labeling information is used for indicating a target type entity in the sample sentence.
In general, the entity labeling information may be information labeled by using an existing BIO labeling method. As an example, a certain sample sentence is "the client of the present bidding task is Beijing XX technology limited, the bidding agency is Beijing YY technology limited", the corresponding entity labeling information is' O O O O O O O O O O B-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG I-ORG O O O O O O O B-ORG I-ORG-ORG I-ORG I-ORG I-ORG I-ORG I-ORG. Where O represents the corresponding character as a non-target type entity (here, an organization entity) character, B-ORG represents the corresponding character as the first character of the organization entity, and I-ORG represents the corresponding character as the non-first character of the organization entity.
In this embodiment, the entity labeling information may be automatically generated by the electronic device, or may be manually set.
In S220, the electronic device may train the initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity labeling information, to obtain the target type entity recognition model.
Specifically, the initial target type entity recognition model may be built based on existing neural network models of various structures. For example, models constructed using ERNIE (Enhanced Language Representation with Informative Entities) models, BERT (Bidirectional Encoder Representations from Transformers) models, in combination with conditional random fields (CRF, conditional Random Field) can be used. For another example, an initial target type entity recognition model may be constructed based on RoBERTa, XLNET, etc.
The electronic device may use a machine learning method, take a sample sentence in the obtained first sample sentence set as input, take entity labeling information corresponding to the input sample sentence as expected output, train the initial target type entity recognition model, and obtain actual output for each training of the input sample sentence. The actual output is data actually output by the entity identification model of the initial target type and is used for representing entity labeling information. Then, the electronic device may adjust parameters of the initial target type entity recognition model based on the actual output and the expected output by using a gradient descent method and a back propagation method, and use the model obtained after each adjustment of the parameters as the initial target type entity recognition model for the next training, and end the training when a preset training end condition is met, so as to train to obtain the target type entity recognition model.
It should be noted that, the preset training ending conditions may include, but are not limited to, at least one of the following: the training time exceeds the preset duration; the training times exceed the preset times; the calculated loss values converge using a predetermined loss function, such as a cross entropy loss function.
In S230, the electronic device may obtain the second set of sample sentences locally or remotely. The second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information.
The preset categories may include at least one category. For example, when the target type entity is an organization entity, the preset categories may include two types of clients and suppliers, and the corresponding non-preset categories are non-clients and non-suppliers.
The positive sample annotation information is used for indicating that the positive sample sentence comprises a target type entity of a preset category, and the negative sample annotation information is used for indicating that the negative sample sentence comprises a target type entity of a non-preset category. As an example, when the preset categories include two types of clients and suppliers, the number 0 indicates that the category of the target type entity is a client, the number 1 indicates that the category of the target type entity is a supplier, and the number 2 indicates that the category of the target type entity is a non-client and a non-supplier.
Optionally, the target type entity of the preset category included in the positive sample sentence and the target type entity of the preset category included in the negative sample sentence may be marked by a mark indicating a position of the target type entity in the sample sentence, so that the electronic device may determine the position of the target type entity of the preset category or the target type entity of the non-preset category from the sample sentence.
Alternatively, positive sample sentences and negative sample sentences may be composed of sentence pairs, the positive sample sentences being composed of one sentence and target type entities of a preset category, the negative sample sentences being composed of one sentence and target type entities of a non-preset category. For example, a positive sample statement includes the following statement pairs: the clients of the bid-making work are Beijing XX science and technology limited company.
In S240, the electronic device may train the initial entity classification model based on the positive sample sentence and the corresponding positive sample labeling information in the second sample sentence set, and the negative sample sentence and the corresponding negative sample labeling information, to obtain the entity classification model.
Specifically, the initial entity classification model may be built based on existing neural network models of various structures. For example, a classification model for classifying the target type entity is constructed using ERNIE as an initial entity classification model. For another example, the initial entity classification model may include a word2vec model for determining feature vectors of sentences, and a classifier for classifying feature vectors, such as SVM, XGBoost, ensemble learning model (e.g., stacking).
It should be noted that, the entity classification model may be a classification model or a multi-classification model. The multi-classification model may directly determine the class of the target type entity in the input sentence. The number of the classification models may be at least one, for example, when the number of the preset categories is a plurality of, the classification model corresponding to each preset category may be trained, and each classification model is used for determining whether the target type entity in the input sentence is the corresponding preset category.
The electronic device may train the initial entity classification model using a machine learning method, taking positive sample sentences in the obtained second sample sentence set as input, positive sample labeling information corresponding to the input positive sample sentences as desired output, and negative sample sentences as input, and negative sample labeling information corresponding to the input negative sample sentences as desired output. For each training input sample sentence, the actual output can be obtained. The actual output is data actually output by the initial entity classification model and is used for representing positive sample labeling information or negative sample labeling information. Then, the electronic device may adjust parameters of the initial entity classification model based on the actual output and the expected output by using a gradient descent method and a back propagation method, and use the model obtained after each adjustment of the parameters as the initial entity classification model for the next training, and end the training when a preset training end condition is met, so as to train to obtain the entity classification model.
It should be noted that, the preset training ending conditions may include, but are not limited to, at least one of the following: the training time exceeds the preset duration; the training times exceed the preset times; the calculated loss values converge using a predetermined loss function, such as a cross entropy loss function.
In some alternative implementations, as shown in fig. 3, step S210 includes the following sub-steps:
s2101, extracting a first sample sentence set from a preset sample text, and determining a target type entity from sample sentences included in the first sample sentence set.
The number of the preset sample texts may be at least one. The preset sample text may be various types of text, for example, the preset sample text may include a plurality of bidding documents obtained from a bidding website. As another example, the preset sample text may also include text obtained from a preset dataset (e.g., an MSRA dataset). Typically, the electronic device may randomly extract a plurality of sentences from the preset sample text to form a first sample sentence set.
In this step, the method of determining the target type entity in the sample sentence may include various kinds.
In some alternative implementations, the electronic device may determine the target type entity from sample statements included in the first set of sample statements using at least one of:
In one mode, a target type entity is determined from sample sentences included in a first sample sentence set based on a preset regular expression.
As an example, for an organization entity whose category is a customer, a regular expression may be: (. Similarly, using multiple regular expressions, each category of organization entities may be determined from each sample statement included in the first set of sample statements.
In a second mode, based on a preset prefix dictionary tree constructed by the target type entity, the target type entity is searched from sample sentences included in the first sample sentence set.
The prefix dictionary tree may be pre-constructed by using a database storing target type entities, and the electronic device may search the target type entities included in the prefix dictionary tree from each sample sentence included in the first sample sentence set.
It should be noted that, the target type entity may be determined in the first mode or the second mode, or the target type entity may be determined in both the first mode and the second mode, and the union of the target type entities extracted in both modes may be taken.
The two ways of extracting the entity can realize that the electronic equipment automatically extracts the target type entity from the sample sentences included in the first sample sentence set, and the two ways are combined to enable the extracted target type entity to be more comprehensive.
S2102, based on the position of the target type entity in the sample sentence, entity labeling information corresponding to the sample sentence included in the sample sentence subset is generated.
The entity labeling information may indicate a location of the target type entity in the sample sentence, and specifically reference may be made to the example in step S210.
According to the embodiment corresponding to fig. 3, the electronic device automatically completes the extraction of the first sample sentence set from the preset sample text, determines the target type entity from the sample sentences and generates the entity labeling information, so that the efficiency of labeling the sample sentences can be improved compared with a manual labeling method.
In some alternative implementations, as shown in fig. 4, the step S230 includes the following substeps to obtain the positive sample sentence and the negative sample sentence included in the second sample sentence set:
s2301, extracting an initial sample sentence set from a preset sample text.
The preset sample text may be the same as or different from the preset sample text described in S2101. The number of preset sample texts of this step may be at least one. The preset sample text may be various types of text, for example, the preset sample text may include a plurality of bidding documents obtained from a bidding website. Typically, the electronic device may randomly extract a plurality of sentences from the preset sample text to form an initial sample sentence set.
S2302, determining a sample sentence including a target type entity of a preset category from the initial sample sentence set, and determining a sample sentence including a target type entity of a non-preset category.
As an example, the electronic device may determine, from sample sentences included in the initial sample sentence set, a target type entity of a preset category based on a regular expression corresponding to a preset category, and may determine, from sample sentences included in the initial sample sentence set, a target type entity of a non-preset category based on a regular expression corresponding to a preset non-preset category.
S2303, determining statement pairs consisting of sample statements in which target type entities of a preset category are located and target type entities of the preset category as positive sample statements, and generating positive sample labeling information representing the target type entities of the preset category.
As an example, when the preset categories include two types of clients and suppliers, the number 0 indicates that the category of the target type entity is a client, the number 1 indicates that the category of the target type entity is a supplier, and the number 2 indicates that the category of the target type entity is a non-client and a non-supplier. Some positive sample statement includes statement pairs of: the clients of the bid-making work are Beijing XX science and technology limited companies, and the bid-making agent is Beijing YY science and technology limited company. The positive sample label information corresponding to the positive sample sentence is the number 0.
S2304, determining statement pairs consisting of sample statements where target type entities of non-preset categories are located and target type entities of non-preset categories as negative sample statements, and generating negative sample labeling information representing the target type entities of the non-preset categories.
As an example, a statement pair included in a negative sample statement is: the clients of the bid-making work are Beijing XX technology limited company, and the bid-making agent is Beijing YY technology limited company. The negative sample label information corresponding to the negative sample sentence is the number 2.
According to the embodiment corresponding to fig. 4, the electronic equipment automatically completes the extraction of the initial sample sentence set from the preset sample text, determines the positive sample sentence and the negative sample sentence from each sample sentence, and generates the positive sample labeling information and the negative sample labeling information, so that the labeling efficiency of the sample sentences can be improved compared with a manual labeling method.
In some alternative implementations, as shown in fig. 5, in S2302, the following sub-steps may be included to determine sample statements including target type entities of non-preset categories from the initial set of sample statements:
s23021, determining the target type entity from the sample sentences in the initial sample sentence set by using the target type entity recognition model.
The target type entity recognition model is a model trained in the step S220. The model is obtained by training a large number of training samples by using a machine learning method, and has high recognition accuracy. The electronic device may sequentially input each sentence in the initial sample sentence set into the target type entity recognition model, to obtain a target type entity included in each input sample sentence.
S23022, comparing each determined target type entity with target type entities of preset categories to obtain target type entities of non-preset types.
Specifically, the difference set between the set of each target type entity composition output by the model and the set of target type entities of the preset type determined in the corresponding embodiment of fig. 4 may be taken, so as to obtain a target type entity of a non-preset type.
S23023, determining the sample sentence of the target type entity with the non-preset type as the sample sentence of the target type entity with the non-preset type.
The method described in the corresponding embodiment of fig. 5 can comprehensively and accurately determine the target type entity from the sample sentences in the initial sample sentence set by using the trained target type entity identification model, and further can comprehensively and accurately generate the negative sample sentence.
In some alternative implementations, prior to S210, the electronic device may further perform the steps of:
and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
In particular, the electronic device may replace or remove characters that interfere with model training. As an example, the preset initial text may be an html format text, and the electronic device may replace an html tag included therein with a space or a blank character, replace a carriage return with a space, remove an unnecessary space character, replace an english punctuation mark with a chinese punctuation mark, and the like.
According to the embodiment, the preset initial text is preprocessed in advance to obtain the preset sample text, so that the format of the preset sample text meets the requirement of model training, interference of some unnecessary characters on model training is reduced, and the accuracy of model identification is improved.
With continued reference to fig. 6, a flowchart of an entity extraction method according to an exemplary embodiment of the present disclosure may be applied to an electronic device (such as the terminal device 101, 102, 103 or the server 105 shown in fig. 1), and the method includes the following steps:
s610, acquiring a text to be recognized.
S620, inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity.
S630, determining sentences to be classified from the texts to be recognized based on the target type entity.
S640, inputting the sentences to be classified into a pre-trained entity classification model to obtain entity category information representing the categories of the target type entities.
According to the entity extraction method provided by the embodiment of the disclosure, the target type entity can be accurately and efficiently extracted from the text to be identified and the category of the target type entity can be determined by using the target type entity identification model and the entity classification model which are obtained through training in the corresponding embodiment of the fig. 2.
In S610, the electronic device may obtain text to be recognized locally or remotely. Wherein the text to be recognized may be various types of text. For example, the text to be identified may be a bidding document obtained from a bidding website.
In S620, the electronic device may input the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity.
The target type entity recognition model is trained in advance based on the method described in the corresponding embodiment of fig. 2. Specifically, the electronic device may sequentially input the sentences included in the text to be recognized into the target similar entity recognition model, to obtain target type entities in each sentence. As an example, when the text to be identified is a bidding document, the target type entity may be an organization entity.
In S630, the electronic device may determine a sentence to be classified from the text to be recognized based on the target type entity.
Optionally, the electronic device may use the sentence in which the target type entity is located as the sentence to be classified and mark the target type entity in the sentence to be classified.
In S640, the electronic device may input the sentence to be classified into a pre-trained entity classification model, to obtain entity class information characterizing a class of the target type entity.
The entity classification model is trained in advance based on the method described in the corresponding embodiment of fig. 2. The electronic device may sequentially input each sentence to be classified (for example, the sentence pair) into the entity classification model, so as to obtain entity class information corresponding to each target type entity. As an example, when the text to be identified is a bidding document, and the target type entity is an organization entity, entity category information corresponding to a certain target type entity may characterize the category of the target type entity as customer or vendor or non-customer and non-vendor. When the method is applied to the bidding field, accurate client and provider information can be automatically extracted from the bidding document, and user experience is improved.
In some alternative implementations, the step S610 may include the following substeps:
first, an original text is acquired.
Wherein the original text may be various types of text. As an example, the original text may be text in html format obtained from a bidding website.
And then preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In general, the method of preprocessing may be the same as the alternative implementation of preprocessing described above in relation to the corresponding embodiment of fig. 2. For example, the electronic device may replace html tags included in the original text in html format with spaces or null characters, return with spaces, remove redundant space characters, replace english punctuation with chinese punctuation, and so on.
According to the method and the device, the original text is preprocessed in advance to obtain the text to be recognized, so that the format of the text to be recognized meets the requirement of model recognition, interference of unnecessary characters on model recognition is reduced, and the precision of entity classification is improved.
In some alternative implementations, S630 may be performed as follows:
and forming statement pairs by the statement in which the target type entity is located and the target type entity, and determining the statement pairs as statements to be classified.
For example, a sentence is "the client of this bidding task is Beijing XX technology Co., ltd., and the bidding agency is Beijing YY technology Co., ltd.). Two sentence pairs can be obtained, namely "the client of the present bidding task is Beijing XX technology Co., ltd." "the bidding agent is Beijing YY technology Co., ltd." "the Beijing XX technology Co., ltd." "the client of the present bidding task is Beijing XX technology Co., ltd." "the bidding agent is Beijing YY technology Co., ltd." "the Beijing YY technology Co., ltd.". Both sentence pairs are sentences to be classified.
According to the method, the sentence pairs containing the sentences and the target type entities are used as sentences to be classified, so that the sentences containing the target type entities can be analyzed more pertinently when the model classifies the entities, and the accuracy of entity classification can be improved.
Exemplary apparatus
Fig. 7 schematically illustrates a structural diagram of an identification model generating apparatus according to an embodiment of the present disclosure. The identification model generating apparatus provided in the embodiment of the present disclosure may be provided on a terminal device, or may be provided on a server, or may be provided partially on a terminal device, or may be provided partially on a server, for example, may be provided on the server 105 (according to actual substitution) in fig. 1, but the present disclosure is not limited thereto.
The identification model generating device provided by the embodiment of the disclosure may include: a first obtaining module 710, configured to obtain a first sample sentence set, where a sample sentence in the first sample sentence set includes a target type entity and has corresponding entity labeling information for characterizing the target type entity; the first training module 720 is configured to train the initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity labeling information, to obtain a target type entity recognition model; a second obtaining module 730, configured to obtain a second sample sentence set, where the second sample sentence set includes a positive sample sentence and a negative sample sentence, the positive sample sentence includes a target type entity of a preset category and has corresponding positive sample labeling information, and the negative sample sentence includes a target type entity of a non-preset category and has corresponding negative sample labeling information; the second training module 740 is configured to train the initial entity classification model based on the positive sample sentence and the corresponding positive sample labeling information in the second sample sentence set, and the negative sample sentence and the corresponding negative sample labeling information, to obtain an entity classification model.
In this embodiment, the first obtaining module 710 may obtain the first set of sample sentences locally or remotely. The sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities.
The source of the first sample sentence set may include a plurality of sentences, for example, sentences extracted from a preset text, and sentences extracted from a database such as an MSRA data set.
The target type entity may be various types of entities, for example, the target type entity may be an organization entity (e.g., business name, utility name, etc.). The entity labeling information is used for indicating a target type entity in the sample sentence.
In this embodiment, the entity labeling information may be automatically generated by the device, or may be manually set.
In this embodiment, the first training module 720 may train the initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity labeling information to obtain the target type entity recognition model.
Specifically, the initial target type entity recognition model may be built based on existing neural network models of various structures. For example, models constructed using ERNIE (Enhanced Language Representation with Informative Entities) models, BERT (Bidirectional Encoder Representations from Transformers) models, in combination with conditional random fields (CRF, conditional Random Field) can be used. For another example, an initial target type entity recognition model may be constructed based on RoBERTa, XLNET, etc.
In this embodiment, the second obtaining module 730 may obtain the second sample statement set locally or remotely. The second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information.
The preset categories may include at least one category. For example, when the target type entity is an organization entity, the preset categories may include two types of clients and suppliers, and the corresponding non-preset categories are non-clients and non-suppliers.
The positive sample annotation information is used for indicating that the positive sample sentence comprises a target type entity of a preset category, and the negative sample annotation information is used for indicating that the negative sample sentence comprises a target type entity of a non-preset category. As an example, when the preset categories include two types of clients and suppliers, the number 0 indicates that the category of the target type entity is a client, the number 1 indicates that the category of the target type entity is a supplier, and the number 2 indicates that the category of the target type entity is a non-client and a non-supplier.
In this embodiment, the second training module 740 may be built based on existing neural network models of various structures. For example, a classification model for classifying the target type entity is constructed using ERNIE as an initial entity classification model. For another example, the initial entity classification model may include a word2vec model for determining feature vectors of sentences, and a classifier for classifying feature vectors, such as SVM, XGBoost, ensemble learning model (e.g., stacking).
It should be noted that, the entity classification model may be a classification model or a multi-classification model. The multi-classification model may directly determine the class of the target type entity in the input sentence. The number of the classification models may be at least one, for example, when the number of the preset categories is a plurality of, the classification model corresponding to each preset category may be trained, and each classification model is used for determining whether the target type entity in the input sentence is the corresponding preset category.
Referring to fig. 8, fig. 8 is a schematic structural view of a data compression apparatus provided in another exemplary embodiment of the present disclosure.
In some alternative implementations, the first acquisition module 710 includes: a first extraction unit 7101, configured to extract a first sample sentence set from a preset sample text, and determine a target type entity from sample sentences included in the first sample sentence set; a generating unit 7102, configured to generate entity labeling information corresponding to the sample sentence included in the sample sentence subset based on the position of the target type entity in the sample sentence.
In some alternative implementations, the second acquisition module 730 includes: a second extracting unit 7031, configured to extract an initial sample sentence set from a preset sample text; a first determining unit 7302 for determining a sample sentence including a target type entity of a preset category from an initial sample sentence set, and determining a sample sentence including a target type entity of a non-preset category; a second determining unit 7303, configured to determine, as a positive sample sentence, a sentence pair formed by a sample sentence in which a target type entity of a preset type is located and a target type entity of the preset type, and generate positive sample labeling information representing the target type entity of the preset type; the third determining unit 7304 is configured to determine, as a negative sample sentence, a sentence pair formed by a sample sentence in which a target type entity of a non-preset type is located and a target type entity of a non-preset type, and generate negative sample labeling information representing the target type entity of the non-preset type.
In some alternative implementations, the first determining unit 7302 includes: a first determining subunit 73021, configured to determine, using the target type entity identification model, a target type entity from the sample sentences in the initial sample sentence set; the comparison subunit 73022 is configured to compare each determined target type entity with a target type entity of a preset class to obtain a target type entity of a non-preset type; the second determining subunit 73023 is configured to determine, as a sample sentence including a target type entity of a non-preset type, a sample sentence in which the target type entity of the non-preset type is located.
In some alternative implementations, the first extraction unit 7101 is further to: determining a target type entity from sample statements included in the first set of sample statements using at least one of: in a first mode, determining a target type entity from sample sentences included in a first sample sentence set based on a preset regular expression; in a second mode, based on a preset prefix dictionary tree constructed by the target type entity, the target type entity is searched from sample sentences included in the first sample sentence set.
In some alternative implementations, the apparatus further includes: the preprocessing module 750 is configured to preprocess the preset initial text to obtain a preset sample text that accords with a preset format.
According to the entity recognition model generation device provided by the embodiment of the disclosure, the first sample sentence set and the second sample sentence set are obtained, the target type entity recognition model is trained based on the first sample sentence set by using a machine learning method, and the entity classification model is trained based on the second sample sentence set, so that a model with high extraction accuracy and classification accuracy can be obtained, and the trained target type entity recognition model and entity classification model are used for accurately extracting target type entities from texts and accurately classifying the target type entities.
Specific implementations of each module, unit and subunit in the entity recognition model generating apparatus provided in the embodiments of the present disclosure may refer to the content in the entity recognition model generating method, which is not described herein again.
Fig. 9 schematically illustrates a structural diagram of an entity extraction apparatus according to an embodiment of the present disclosure. The entity extraction apparatus provided in the embodiment of the present disclosure may be disposed on a terminal device, or may be disposed on a server, or may be partially disposed on a terminal device, or may be partially disposed on a server, for example, may be disposed on the server 105 (according to actual substitution) in fig. 1, but the present disclosure is not limited thereto.
The entity extraction device provided by the embodiment of the disclosure may include: a third obtaining module 910, configured to obtain a text to be identified; the recognition module 920 is configured to input a text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, where the target type entity recognition model is pre-trained based on the method of the first aspect; a determining module 930, configured to determine, based on the target type entity, a sentence to be classified from the text to be recognized; the classification module 940 is configured to input a sentence to be classified into a pre-trained entity classification model, to obtain entity class information that characterizes a class of the target type entity, where the entity classification model is pre-trained based on the method of the first aspect. .
In this embodiment, the third obtaining module 910 may obtain the text to be recognized locally or remotely. Wherein the text to be recognized may be various types of text. For example, the text to be identified may be a bidding document obtained from a bidding website.
In this embodiment, the recognition module 920 may input the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity.
The target type entity recognition model is trained in advance based on the method described in the corresponding embodiment of fig. 2. Specifically, the training module 920 may sequentially input the sentences included in the text to be recognized into the target similar entity recognition model to obtain target type entities in each sentence. As an example, when the text to be identified is a bidding document, the target type entity may be an organization entity.
In this embodiment, the determining module 930 may determine the sentence to be classified from the text to be recognized based on the target type entity.
Alternatively, the determining module 930 may take the sentence in which the target type entity is located as the sentence to be classified and mark the target type entity in the sentence to be classified.
In this embodiment, the classification module 940 may input the sentence to be classified into a pre-trained entity classification model to obtain entity class information characterizing the class of the target type entity.
The entity classification model is trained in advance based on the method described in the corresponding embodiment of fig. 2. The classification module 940 may sequentially input each sentence to be classified (for example, the sentence pair) into the entity classification model, to obtain entity class information corresponding to each target type entity. As an example, when the text to be identified is a bidding document, and the target type entity is an organization entity, entity category information corresponding to a certain target type entity may characterize the category of the target type entity as customer or vendor or non-customer and non-vendor. When the method is applied to the bidding field, accurate client and provider information can be automatically extracted from the bidding document, and user experience is improved.
In some alternative implementations, the third acquisition module 910 is further configured to: acquiring an original text; and preprocessing the original text to obtain the text to be recognized which accords with the preset format.
In some alternative implementations, the determining module 930 is further to: determining a sentence to be classified from the text to be recognized based on the target type entity, including: and forming statement pairs by the statement in which the target type entity is located and the target type entity, and determining the statement pairs as statements to be classified.
According to the entity extraction device provided by the embodiment of the disclosure, the target type entity can be accurately and efficiently extracted from the text to be identified and the category of the target type entity can be determined by using the target type entity identification model and the entity classification model which are obtained through training in the corresponding embodiment of fig. 2.
The specific implementation of each module, unit and subunit in the entity extraction apparatus provided in the embodiments of the present disclosure may refer to the content in the entity extraction method, which is not described herein again.
It should be noted that although in the above detailed description several modules, units and sub-units of the apparatus for action execution are mentioned, this division is not mandatory. Indeed, the features and functions of two or more modules, units, and sub-units described above may be embodied in one module, unit, and sub-unit, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module, unit, and sub-unit described above may be further divided into ones that are embodied by a plurality of modules, units, and sub-units.
Exemplary electronic device
As shown in fig. 10, the example electronic device 100 includes a processor 1001 for executing software routines, although a single processor is shown for clarity, the electronic device 100 may also include a multi-processor system. The processor 1001 is connected to a communication infrastructure 1002 for communicating with other components of the electronic device 100. The communication infrastructure 1002 may include, for example, a communication bus, a crossbar switch, or a network.
The electronic device 100 also includes memory, such as random access memory (Random Access Memory, RAM), which may include a main memory 1003 and a secondary memory 1010. The secondary memory 1010 may include, for example, a hard disk drive 1011 and/or a removable storage drive 1012, and the removable storage drive 1012 may include a floppy disk drive, a magnetic tape drive, an optical disk drive, and the like. The removable storage drive 1012 reads from and/or writes to a removable storage unit 1013 in a conventional manner. The removable storage unit 1013 may comprise a floppy disk, magnetic tape, optical disk, etc. which is read from and written to by removable storage drive 1012. One skilled in the relevant art will appreciate that the removable storage unit 1013 includes a computer-readable storage medium having stored thereon computer-executable program code instructions and/or data.
In an alternative embodiment, secondary memory 1010 may additionally or alternatively include other similar means for allowing computer programs or other instructions to be loaded into electronic device 100. Such means may include, for example, a removable storage unit 1021 and an interface 1020. Examples of removable storage unit 1021 and interface 1020 include: program cartridge (cartridge) and cartridge interfaces (such as those found in video game console devices), removable memory chips (such as EPROM or PROM) and associated sockets, and other removable storage units 1021 and interfaces 1020 which allow software and data to be transferred from the removable storage units 1021 to the electronic device 100.
The electronic device 100 also includes at least one communication interface 1040. Communication interface 1040 allows software and data to be transferred between electronic device 100 and external devices via communication path 1041. In various embodiments of the present disclosure, communication interface 1040 allows data to be transferred between electronic device 100 and a data communication network, such as a public data or private data communication network. Communication interface 1040 may be used to exchange data between different electronic devices 100, such electronic devices 100 forming part of an interconnected computer network. Examples of communication interface 1040 may include a modem, a network interface (such as an ethernet card), a communication port, an antenna with associated circuitry, and so forth. Communication interface 1040 may be wired or may be wireless. Software and data transferred via communication interface 1040 are in the form of signals which may be electronic, magnetic, optical or other signals capable of being received by communication interface 1040. These signals are provided to a communication interface via a communication path 1041.
As shown in fig. 10, the electronic device 100 further includes a display interface 1031 and an audio interface 1032, the display interface 1031 performing operations for rendering images to an associated display 1030, the audio interface 1032 performing operations for playing audio content through an associated speaker 1033.
In this document, the term "computer program product" may refer, in part, to: a removable storage unit 1013, a removable storage unit 1021, a hard disk installed in the hard disk drive 1011, or a carrier wave carrying software through a communication path 1041 (wireless link or cable) to the communication interface 1040. Computer-readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to electronic device 100 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROMs, DVDs, blu-ray (TM) optical disks, hard disk drives, ROMs or integrated circuits, USB memory, magneto-optical disks, or computer-readable cards such as PCMCIA cards, etc., whether internal or external to electronic device 100. Transitory or non-tangible computer readable transmission media may also participate in providing software, applications, instructions, and/or data to the electronic device 100, examples of such transmission media include radio or infrared transmission channels, network connections to another computer or another networked device, and the internet or intranets including email transmissions and information recorded on websites, and the like.
Computer programs (also called computer program code) are stored in main memory 1003 and/or secondary memory 1010. Computer programs may also be received via communications interface 1040. Such computer programs, when executed, enable the electronic device 100 to perform one or more features of the embodiments discussed herein. In various embodiments, the computer programs, when executed, enable the processor 1001 to perform the features of the embodiments described above. Such computer programs thus represent controllers of the electronic device 100.
The software may be stored in a computer program product and loaded into the electronic device 100 using the removable storage drive 1012, hard drive 1011, or interface 1020. Alternatively, the computer program product may be downloaded to the electronic device 100 via the communication path 1041. Which when executed by the processor 1001 causes the electronic device 100 to perform the functions of the embodiments described herein.
It should be understood that the embodiment of fig. 10 is given by way of example only. Accordingly, in some embodiments, one or more features of the electronic device 100 may be omitted. Moreover, in some embodiments, one or more features of the electronic device 100 may be combined together. Additionally, in some embodiments, one or more features of the electronic device 100 may be separated into one or more components.
It will be appreciated that the elements shown in fig. 10 serve to provide a way to perform the various functions and operations of the servers described in the above embodiments.
In one embodiment, a server may be generally described as a physical device comprising at least one processor and at least one memory including computer program code. The at least one memory and the computer program code are configured to, with the at least one processor, cause the physical device to perform the necessary operations.
Exemplary computer-readable storage Medium
Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the functions of the methods shown in fig. 2-6.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by an electronic device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.
Exemplary computer program
The disclosed embodiments also provide a computer program product for storing computer readable instructions that, when executed, cause a computer to perform the entity recognition model generation method or the entity extraction method in any of the possible implementations described above.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In one alternative example, the computer program product is embodied as a computer storage medium, and in another alternative example, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with program instructions.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (13)

1. The entity recognition model generation method is characterized by comprising the following steps of:
acquiring a first sample sentence set, wherein sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities;
training an initial target type entity recognition model based on sample sentences in the first sample sentence set and corresponding entity labeling information to obtain a target type entity recognition model;
obtaining a second sample sentence set, wherein the second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information;
and training an initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information to obtain an entity classification model.
2. The method of claim 1, wherein the obtaining the first set of sample statements comprises:
Extracting a first sample sentence set from a preset sample text, and determining a target type entity from sample sentences included in the first sample sentence set;
and generating entity labeling information corresponding to the sample sentences included in the first sample sentence set based on the position of the target type entity in the sample sentences.
3. The method of claim 1, wherein the obtaining a second set of sample statements comprises:
extracting an initial sample sentence set from a preset sample text;
determining sample sentences including target type entities of preset categories from the initial sample sentence set, and determining sample sentences including target type entities of non-preset categories;
determining statement pairs consisting of sample statements in which the target type entities of the preset category are located and the target type entities of the preset category as positive sample statements, and generating positive sample labeling information representing the target type entities of the preset category;
and determining statement pairs consisting of sample statements where the target type entities of the non-preset category are located and the target type entities of the non-preset category as negative sample statements, and generating negative sample labeling information representing the target type entities of the non-preset category.
4. A method according to claim 3, wherein said determining sample sentences including target type entities of non-preset categories comprises:
determining a target type entity from sample sentences in the initial sample sentence set by using the target type entity identification model;
comparing each determined target type entity with the target type entity of the preset category to obtain a target type entity of a non-preset type;
and determining the sample statement of the target type entity with the non-preset type as the sample statement of the target type entity with the non-preset type.
5. The method of claim 2, wherein the determining a target type entity from the sample statements included in the first set of sample statements comprises:
determining a target type entity from sample statements included in the first set of sample statements using at least one of:
in a first mode, determining a target type entity from sample sentences included in the first sample sentence set based on a preset regular expression;
and in a second mode, searching the target type entity from sample sentences included in the first sample sentence set based on a preset prefix dictionary tree constructed by the target type entity.
6. The method according to one of claims 2-5, characterized in that before said extracting the first set of sample sentences from the preset sample text, the method further comprises:
and preprocessing the preset initial text to obtain a preset sample text conforming to a preset format.
7. An entity extraction method, comprising:
acquiring a text to be identified;
inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by training in advance based on the method of one of claims 1 to 6;
determining sentences to be classified from the texts to be identified based on the target type entity;
inputting the sentence to be classified into a pre-trained entity classification model to obtain entity class information representing the class of the target type entity, wherein the entity classification model is pre-trained based on the method of one of claims 1-6.
8. The method of claim 7, wherein the obtaining text to be recognized comprises:
acquiring an original text;
and preprocessing the original text to obtain a text to be recognized which accords with a preset format.
9. The method of claim 7, wherein the determining a sentence to be classified from the text to be recognized based on the target type entity comprises:
and forming statement pairs by the statement in which the target type entity is located and the target type entity, and determining the statement pairs as statements to be classified.
10. An entity recognition model generation device, comprising:
the first acquisition module is used for acquiring a first sample sentence set, wherein sample sentences in the first sample sentence set comprise target type entities and have corresponding entity labeling information for representing the target type entities;
the first training module is used for training an initial target type entity recognition model based on the sample sentences in the first sample sentence set and the corresponding entity labeling information to obtain a target type entity recognition model;
the second acquisition module is used for acquiring a second sample sentence set, wherein the second sample sentence set comprises positive sample sentences and negative sample sentences, the positive sample sentences comprise target type entities of preset categories and have corresponding positive sample labeling information, and the negative sample sentences comprise target type entities of non-preset categories and have corresponding negative sample labeling information;
And the second training module is used for training an initial entity classification model based on the positive sample sentences and the corresponding positive sample labeling information in the second sample sentence set and the negative sample sentences and the corresponding negative sample labeling information to obtain an entity classification model.
11. An entity extraction device, comprising:
the third acquisition module is used for acquiring the text to be identified;
the recognition module is used for inputting the text to be recognized into a pre-trained target type entity recognition model to obtain a target type entity, wherein the target type entity recognition model is obtained by training in advance based on the method of one of claims 1 to 6;
the determining module is used for determining sentences to be classified from the texts to be identified based on the target type entity;
the classification module is configured to input the sentence to be classified into a pre-trained entity classification model, to obtain entity class information that characterizes a class of the target type entity, where the entity classification model is pre-trained based on the method according to one of claims 1-6.
12. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
Wherein the processor is configured to perform the method of any of claims 1-9 via execution of the executable instructions.
13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-9.
CN202110208364.9A 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device Active CN113010638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110208364.9A CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110208364.9A CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Publications (2)

Publication Number Publication Date
CN113010638A CN113010638A (en) 2021-06-22
CN113010638B true CN113010638B (en) 2024-02-09

Family

ID=76385896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110208364.9A Active CN113010638B (en) 2021-02-25 2021-02-25 Entity recognition model generation method and device and entity extraction method and device

Country Status (1)

Country Link
CN (1) CN113010638B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468330B (en) * 2021-07-06 2023-04-28 北京有竹居网络技术有限公司 Information acquisition method, device, equipment and medium
CN113626592A (en) * 2021-07-08 2021-11-09 中汽创智科技有限公司 Corpus-based classification method and device, electronic equipment and storage medium
CN114254109B (en) * 2021-12-15 2023-09-19 北京金堤科技有限公司 Method and device for determining industry category
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition
CN114611497B (en) * 2022-05-10 2022-08-16 北京世纪好未来教育科技有限公司 Training method of language diagnosis model, language diagnosis method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145153A (en) * 2018-07-02 2019-01-04 北京奇艺世纪科技有限公司 It is intended to recognition methods and the device of classification
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
CN111739520A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Speech recognition model training method, speech recognition method and device
CN111813954A (en) * 2020-06-28 2020-10-23 北京邮电大学 Method and device for determining relationship between two entities in text statement and electronic equipment
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10489439B2 (en) * 2016-04-14 2019-11-26 Xerox Corporation System and method for entity extraction from semi-structured text documents
TWI645303B (en) * 2016-12-21 2018-12-21 財團法人工業技術研究院 Method for verifying string, method for expanding string and method for training verification model
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning
CN109165385B (en) * 2018-08-29 2022-08-09 中国人民解放军国防科技大学 Multi-triple extraction method based on entity relationship joint extraction model
CN111950279B (en) * 2019-05-17 2023-06-23 百度在线网络技术(北京)有限公司 Entity relationship processing method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145153A (en) * 2018-07-02 2019-01-04 北京奇艺世纪科技有限公司 It is intended to recognition methods and the device of classification
CN110263338A (en) * 2019-06-18 2019-09-20 北京明略软件系统有限公司 Replace entity name method, apparatus, storage medium and electronic device
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
CN111813954A (en) * 2020-06-28 2020-10-23 北京邮电大学 Method and device for determining relationship between two entities in text statement and electronic equipment
CN111739520A (en) * 2020-08-10 2020-10-02 腾讯科技(深圳)有限公司 Speech recognition model training method, speech recognition method and device
CN112069312A (en) * 2020-08-12 2020-12-11 中国科学院信息工程研究所 Text classification method based on entity recognition and electronic device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Modelings and techniques in Named Entity Recognition-an Information Extraction task;N. Kanya等;IET Chennai 3rd International on Sustainable Energy and Intelligent Systems (SEISCON 2012);1-5 *
付皓 ; .基于Bert,ReZero和CRF的中文命名实体识别.电脑编程技巧与维护.2020,(06),12-13+28. *
基于多知识源的中文词法分析系统;姜维;王晓龙;关毅;赵健;;计算机学报(01);139-147 *
基于神经网络的电子病历实体识别;祖木然提古丽·库尔班;中国优秀硕士学位论文全文数据库医药卫生科技辑(第1期);E053-323 *
面向招标数据的命名实体识别方法研究及应用;孙誉侨;中国优秀硕士学位论文全文数据库信息科技辑(第1期);I138-2140 *

Also Published As

Publication number Publication date
CN113010638A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN113010638B (en) Entity recognition model generation method and device and entity extraction method and device
CN109635103B (en) Abstract generation method and device
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN104735468B (en) A kind of method and system that image is synthesized to new video based on semantic analysis
CN112487149B (en) Text auditing method, model, equipment and storage medium
CN108334489B (en) Text core word recognition method and device
CN113961685A (en) Information extraction method and device
CN112395420A (en) Video content retrieval method and device, computer equipment and storage medium
CN103577989A (en) Method and system for information classification based on product identification
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN113360699A (en) Model training method and device, image question answering method and device
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN113469298A (en) Model training method and resource recommendation method
CN116932730B (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN108052686B (en) Abstract extraction method and related equipment
CN110489740B (en) Semantic analysis method and related product
CN111199151A (en) Data processing method and data processing device
CN116229313A (en) Label construction model generation method and device, electronic equipment and storage medium
CN116108181A (en) Client information processing method and device and electronic equipment
CN114218381B (en) Method, device, equipment and medium for identifying position
CN115115432A (en) Artificial intelligence based product information recommendation method and device
CN114492390A (en) Data expansion method, device, equipment and medium based on keyword recognition
CN113342932A (en) Method and device for determining target word vector, storage medium and electronic device
CN113591467B (en) Event main body recognition method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant