CN117313727A - Model training and entity recognition method - Google Patents

Model training and entity recognition method Download PDF

Info

Publication number
CN117313727A
CN117313727A CN202311207574.1A CN202311207574A CN117313727A CN 117313727 A CN117313727 A CN 117313727A CN 202311207574 A CN202311207574 A CN 202311207574A CN 117313727 A CN117313727 A CN 117313727A
Authority
CN
China
Prior art keywords
entity
abbreviation
text
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311207574.1A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202311207574.1A priority Critical patent/CN117313727A/en
Publication of CN117313727A publication Critical patent/CN117313727A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a model training and entity recognition method, which is characterized in that according to the first frequency of entity abbreviations contained in a specified text and the second frequency of each entity abbreviation in a general corpus, the reference index of each entity abbreviation is determined, the target abbreviation corresponding to the specified text is determined from each entity abbreviation, the specified text is further used as a training sample, the target abbreviation is used as a label, and the entity recognition model is trained based on the training sample and the labels thereof. The method for determining the labeling of the training samples based on the first frequency of the entity abbreviations in the appointed text and the second frequency of the entity abbreviations in the general corpus can achieve the purpose of automatically excavating the abbreviations, does not need manual labeling, and reduces labor cost. And the accuracy of labeling can be improved, so that the recognition accuracy of the entity recognition model is improved.

Description

Model training and entity recognition method
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method for model training and entity identification.
Background
With the development of artificial intelligence technology, natural language processing (Natural Language Processing, NLP) has received extensive attention. Wherein entity identification is an important component of NLP, and the entity can be name of person, organization, place, and more widely includes number, date, currency, address, etc.
At present, an entity recognition model can be trained based on texts and labels of entity words in the texts, and then the texts are recognized according to the trained entity recognition model.
However, in some application scenarios, there may be cases where the number of words of the full name of the entity is large. For entities with more full names, the full names of the entities can be replaced by acronyms. For example, in the equipment maintenance record, in order to improve the recording efficiency, a user may use an acronym of the equipment to replace the whole term of the equipment, and record "low-pressure heater inlet electric door flange oil seepage" as "low-pressure inlet electric door flange oil seepage", where "low-pressure addition" is the acronym of the whole term "low-pressure heater". Because the entity recognition model has fewer labels of the entity abbreviations in the training samples of the current entity recognition model, the entity recognition model has lower accuracy in recognizing the text containing the entity abbreviations.
Disclosure of Invention
The present application provides a method for model training and entity recognition to partially solve the above-mentioned problems in the prior art.
The application adopts the following technical scheme:
the application provides a model training method, which comprises the following steps:
acquiring a specified text and determining each entity abbreviation contained in the specified text;
According to the first frequency of each entity abbreviation in the appointed text and the second frequency of each entity abbreviation in the general corpus, respectively determining the reference index of each entity abbreviation;
determining a target abbreviation corresponding to the appointed text from the entity abbreviations according to the reference index of the entity abbreviations;
taking the appointed text as a training sample, and determining the label of the training sample according to the target abbreviation corresponding to the appointed text;
and training an entity recognition model according to the training sample and the label of the training sample.
Optionally, determining each entity abbreviation contained in the specified text specifically includes:
obtaining a target tree; the target tree is constructed based on a plurality of entity full-name words and each reference abbreviation word corresponding to each entity full-name word;
and matching the appointed text with the target tree to obtain each entity abbreviation contained in the appointed text.
Optionally, the construction process of the target tree specifically includes:
determining the connection sequence of each node according to the arrangement sequence of each word contained in the full term of each entity and the arrangement sequence of each word contained in each reference abbreviation; each node is determined by each word contained in the entity full term and each word contained in each reference abbreviation;
Determining a target node from the nodes, and determining a father node of the target node and a child node of the target node according to the connection sequence of the nodes;
and constructing a target tree according to the target node, the father node of the target node and the child node of the target node.
Optionally, matching the specified text with the target tree to obtain each entity abbreviation contained in the specified text, which specifically includes:
sequentially matching each word contained in the specified text with each node of the target tree, and determining a target path hit by the specified text in the target tree;
and determining each entity abbreviation contained in the appointed text according to the words corresponding to each node on the target path.
Optionally, determining a target path hit by the specified text in the target tree specifically includes:
sequentially matching each word contained in the appointed text with each node of the target tree, and determining each path hit by the appointed text in the target tree;
determining node numbers contained in each path hit by the specified text in the target tree;
and determining a target path hit by the specified text in the target tree according to the paths of which the node number is larger than a preset threshold value.
Optionally, determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the specified text and the second frequency of each entity abbreviation in the universal corpus, which specifically includes:
determining a first weight and at least one second weight corresponding to each entity abbreviation respectively; the first weights are weights corresponding to the first frequencies of the entity abbreviations in the corresponding appointed texts, and each second weight is a weight corresponding to the second frequencies of the entity abbreviations in the corresponding general corpus;
and according to the first weight and the second weight, carrying out weighted summation on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the universal corpus, and obtaining the reference index of the entity abbreviation.
The application provides an entity identification method, which comprises the following steps:
acquiring an entity identification request; wherein the entity recognition request corresponds to text to be recognized;
obtaining entity abbreviations corresponding to the text to be identified by using an entity identification model; the entity recognition model is obtained based on the model training method.
Optionally, the method further comprises:
responding to the entity identification request, and matching the text to be identified with the corresponding relation; the corresponding relation is used for representing the relation between the entity full title of the appointed text and the corresponding target abbreviation;
responding to successful matching, and taking the matched target abbreviation as an entity abbreviation contained in the text to be identified;
the obtaining the entity abbreviation corresponding to the text to be identified by using the entity identification model comprises the following steps:
and in response to the failure of matching, inputting the text to be identified into a training entity identification model to obtain entity abbreviations contained in the text to be identified, which is output by the entity identification model.
The application provides a model training device, include:
the first acquisition module is used for acquiring a specified text and determining each entity abbreviation contained in the specified text;
the reference index determining module is used for determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the appointed text and the second frequency of each entity abbreviation in the universal corpus;
the target abbreviation determining module is used for determining target abbreviations corresponding to the specified texts from the entity abbreviations according to the reference indexes of the entity abbreviations;
The annotation determining module is used for taking the appointed text as a training sample and determining the annotation of the training sample according to the target abbreviation corresponding to the appointed text;
and the training module is used for training the entity recognition model according to the training sample and the label of the training sample.
The application provides an entity identification device, comprising:
the request acquisition module is used for acquiring an entity identification request; wherein the entity recognition request corresponds to text to be recognized;
the entity abbreviation determining module is used for obtaining entity abbreviations corresponding to the text to be recognized by utilizing an entity recognition model; the entity recognition model is obtained based on the model training method.
The present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the model training and entity recognition method described above.
The application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the model training and entity identification method when executing the program.
The above-mentioned at least one technical scheme that this application adopted can reach following beneficial effect:
according to the model training and entity recognition method, according to the first frequency of entity abbreviations contained in the appointed text and the second frequency of the entity abbreviations in the general corpus, the reference index of each entity abbreviation is determined, according to the reference index of each entity abbreviation, the target abbreviation corresponding to the appointed text is determined from each entity abbreviation, further the appointed text is used as a training sample, the label of the training sample is determined according to the target abbreviation of the appointed text, and the entity recognition model is trained based on the training sample and the label thereof. The purpose of automatically excavating the abbreviations can be achieved through the scheme, manual labeling is not needed, and labor cost is reduced. And the labeling mode of the training sample is determined based on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the general corpus, so that the labeling accuracy can be improved, and the recognition accuracy of the entity recognition model is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a schematic flow chart of a model training method in the present application;
FIG. 2 is a schematic flow chart of a model training method in the present application;
FIG. 3 is a schematic diagram of a target tree in the present application;
FIG. 4 is a flow chart of a model training method of the present application;
FIG. 5 is a flow chart of a model training method of the present application;
FIG. 6 is a flow chart of an entity identification method in the present application;
FIG. 7 is a schematic diagram of a model training apparatus provided herein;
FIG. 8 is a schematic diagram of an entity identification device provided in the present application;
fig. 9 is a schematic diagram of an electronic device corresponding to fig. 1 or fig. 6 provided in the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In addition, it should be noted that all actions for acquiring signals, information or data are performed under the condition of conforming to the corresponding data protection rule policy of the location and obtaining the authorization given by the owner of the corresponding device.
With the development of artificial intelligence technology, the NLP field is receiving a great deal of attention. Among them, entity identification (Named Entity Recognition, NER) is widely studied as a fundamental and important task in NLP. Currently, entity recognition mainly adopts a method based on rules and dictionaries, and a method based on machine learning.
The method based on the rules and the dictionary generally adopts expert experience to construct a rule template, the rule template can be in the form of a regular expression, the rule template is matched with the text to be recognized, and the words hitting the template are used as entity words. The machine learning-based method includes the steps that a text containing entity words is used as a training sample, the entity words in the text are marked to obtain marks of the training sample, the training sample is input into an entity recognition model to be trained, the identity of each word contained in the training sample is output through the entity recognition model, the difference between the identity of each word output by the model and the marks of the training sample is minimized as a training target, and the entity recognition model is trained.
In the two schemes, the rule and dictionary-based method often depends on specific languages, fields and text styles, the process of formulating rule templates is time-consuming and difficult to cover languages in all fields, portability is poor for a specific field, a single rule template is not suitable for entity recognition scenes in various different fields, rule templates need to be constructed one by one based on expert experience, and construction period is long and recall rate is low. In the machine learning-based method, the accuracy of the entity recognition model depends on the number and accuracy of the training sample labels, but for the feature field, the label data are less, and the specific field has specific type entity words, so that the accuracy of automatic marking is lower, only manual marking can be relied on, and the labor resource is consumed and the efficiency is poor.
Further, in some application scenarios, there may be a case where the number of words of the full name of the entity is large. For entities with more full names, the full names of the entities can be replaced by acronyms. For example, in the equipment maintenance record, in order to improve the recording efficiency, a user may use an acronym of the equipment to replace the whole term of the equipment, and record "low-pressure heater inlet electric door flange oil seepage" as "low-pressure inlet electric door flange oil seepage", where "low-pressure addition" is the acronym of the whole term "low-pressure heater". For text containing entity abbreviations, the difficulty of constructing rule templates and labeling training samples is greater, resulting in fewer labels for rule templates and training samples.
Based on the method, the model training and entity recognition method is provided, the labeling mode of the training sample is determined based on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the general corpus, the purpose of automatically excavating the abbreviation can be achieved, manual labeling is not needed, and the recognition accuracy of the entity recognition model is guaranteed.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a model training method provided in the present application.
S100: and acquiring the specified text, and determining each entity thumbnail contained in the specified text.
The embodiment of the application provides a model training method which can be executed by electronic equipment such as a server for model training. In addition, after the training of the entity recognition model is completed, the electronic device for executing the entity recognition method provided by the application based on the entity recognition model after the training is completed and the electronic device for executing the model training method may be the same or different, which is not limited in the application.
In practical application, recognizing entity words in the text can avoid word segmentation errors and assist semantic analysis. Entity words are actually entities in text that have a particular meaning or are referred to strongly, and typically include a person name, place name, organization name, date and time, proper nouns, and the like. Entity recognition is the extraction of entities of the type described above from unstructured input text. In addition, more types of entities, such as product names, models, prices, etc., can be identified according to business requirements.
In entity recognition, there is also a special application scenario. When the number of words of the full name of the entity is large, the user is more inclined to express the full name of the entity by using the abbreviation word of the entity, so as to improve the efficiency of recording or expression. The abbreviations are words which are shortened and omitted by longer words, and in linguistics, the abbreviations are expressed in a fully simplified form, which are also called "abbreviations" and "abbreviations". Because the abbreviations are concise and refined in expression, they are used in a great deal in the daily life of people.
Based on the fact that entity abbreviations appear in text, if a rule and dictionary based method is adopted, a rule template needs to be constructed for the language field in which the entity abbreviations appear. However, this method is extremely dependent on the number of entity abbreviations contained in the dictionary, and the rule templates are constructed manually, which is difficult and labor-consuming. When the method based on machine learning is adopted, in the training sample of the current entity recognition model, the labeling of entity abbreviations is less, so that the accuracy of the entity recognition model is lower when the entity recognition model recognizes the text containing the entity abbreviations.
Based on the above situation, in the embodiment of the application, a manner of automatically acquiring the label of the more accurate entity abbreviation is adopted, so that the acquisition difficulty of the label of the training sample required by the entity recognition model and the accuracy of the label are reduced.
For this purpose, in this step, a specified text is first acquired. Wherein the specified text may be unstructured text, and the specified text may contain a plurality of words and words. The specified text may be text in a specific field or may be text in a general corpus, which is not limited in this application. However, in order to enable the trained entity recognition model to recognize entity abbreviations included in the text in the specific field, a specific technical scheme is described by taking the text in the specific field as an example in the application. The designated text may include an acronym, may include only a full term without an acronym, may include both a full term and an acronym, and may include neither an acronym nor a full term, which is not limited in this application, and various expressions (full term or acronyms of different abbreviations) of the same entity may not appear in the same designated text.
Further, each entity abbreviation contained in the specified text is determined. In practical application, if the number of the full words of the entity is large, entity abbreviations can be used for representing the corresponding entity. In general, a full term of an entity may correspond to a plurality of shorthand terms of different types, for example, for the full term "low pressure heater" of an entity, the shorthand terms may be in the shorthand types of "low pressure", "low pressure heating", "pressure heating", and the like. Based on this, in this step, the entity abbreviations included in the specified text may be labeled one by one, and the method for determining the entity abbreviations included in the specified text may be manual labeling, or may be obtained based on matching a pre-constructed target tree including the entity abbreviations, which is not limited in this application.
In addition, it should be noted that, even if the entity abbreviations in the specified text are obtained based on manual labeling in this step, the accuracy of the manual labeling may not be high, i.e., the technical requirements for labeling personnel may be reduced. This is because the reliability and accuracy of the entity representing the entity of each entity thumbnail are also evaluated to obtain the reference index in the subsequent steps S102 to S104, and then each entity thumbnail is screened based on the reference index. Therefore, even if the entity abbreviations in the specified text are obtained based on manual labeling in this step, compared with the current method of directly obtaining more accurate entity word labeling based on manual labeling, the method consumes less manpower and has higher labeling efficiency.
S102: and respectively determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the appointed text and the second frequency of each entity abbreviation in the general corpus.
Specifically, in step S100, each entity abbreviation included in the specified text is determined, but in practical application, when a user uses the abbreviation to express a certain entity, the abbreviation conforming to the expression habit of the user is used to characterize the same entity, so that in the case that the habits of different users using the abbreviation are different, even if multiple specified texts originate from the same language field, the situation that the same entity corresponds to multiple entity abbreviations in different abbreviations can occur.
For this reason, in this step, it is necessary to determine a reference index for each entity abbreviation, which is used to indicate the degree of reliability and accuracy with which the entity abbreviation characterizes the entity to which the entity abbreviation corresponds.
The reliability and accuracy of an entity abbreviation refers to the probability that the entity abbreviation can refer to the corresponding entity. Generally, when a user refers to a corresponding entity by using a certain entity abbreviation, the entity abbreviation is more accurate and reliable. The more users use a certain entity abbreviation, the higher the probability that the entity abbreviation refers to the corresponding entity, the higher the reliability and accuracy of the entity abbreviation. Therefore, in the embodiment of the application, the reliability degree and the accuracy degree of each entity abbreviation are determined according to the occurrence frequency of each entity abbreviation in the text.
Specifically, a first frequency of each entity abbreviation in the specified text is determined respectively, and a second frequency of each entity abbreviation in the universal corpus is determined respectively. The first frequency of the entity abbreviations in the appointed text can represent the reliability degree and the accuracy degree of the entity abbreviations in the specific language field corresponding to the appointed text. The second frequency of entity abbreviations in the generic corpus may characterize the reliability and accuracy of the entity abbreviations in the generic language domain. That is, a probability that each entity abbreviation is used in a specific language domain and a probability that each entity abbreviation is used in a general language domain are determined.
Further, according to the first frequency of each entity abbreviation and the second frequency of each entity abbreviation, reference indexes of each entity abbreviation are respectively determined. The reference index is positively correlated with the first frequency and is positively correlated with the second frequency, i.e., the larger the first frequency is, the larger the second frequency is, the larger the reference index is.
Optionally, the reference index of each entity acronym may be determined according to a weighted sum of the first frequency and the second frequency of each entity acronym, and the weights of the first frequency and the second frequency may be determined according to a priori experience, or manually preset, which is not limited in this application. For example, for the entity abbreviation "low-add" which occurs 5 times in the specified text and 3 times in the generic corpus, it may be determined that the first frequency of the entity abbreviation "low-add" is 5 and the second frequency is 3, and if the weights of the first and second frequencies are both 0.5, the reference index of the entity abbreviation "low-add" is 4.
Optionally, when determining that the entity thumbnail is in the first frequency of the specified text, searching and recalling can be performed in the specified text based on the entity thumbnail, the specified text containing the entity thumbnail is recalled, the number of the specified texts recalled by the entity thumbnail is determined, and then the first frequency of the entity thumbnail in the specified text is determined based on the number of the specified texts recalled by the entity thumbnail. Correspondingly, when determining that the entity abbreviation is in the second frequency of the general corpus, searching and recalling the entity abbreviation in the general corpus to obtain general corpus texts containing the entity abbreviation, determining the number of the general corpus texts recalled by the entity abbreviation, and determining the second frequency of the entity abbreviation in the general corpus based on the number of the general corpus texts recalled by the entity abbreviation.
S104: and determining the target abbreviation corresponding to the appointed text from the entity abbreviations according to the reference index of the entity abbreviations.
In practical application, because the language habits of users are different, or because the abbreviations are ambiguous, or the abbreviations of the abbreviations are different, the abbreviations of different abbreviations corresponding to the same entity have different probabilities of being used in the text respectively, so that the reference indexes of the abbreviations of the entities are different. In general, the larger the reference index, the greater the probability that the entity abbreviation is used in the text, i.e., the more users use the entity abbreviation in the text, the higher the reliability and accuracy of the corresponding entity. Therefore, when determining the target abbreviation corresponding to the specified text, for each specified text, according to the reference index corresponding to each entity abbreviation contained in the specified text, a plurality of entity abbreviations with larger reference indexes are used as the target abbreviation corresponding to the specified text. The number of the target abbreviations may be one or more, which is not limited in this application.
In addition, there is a possibility that a specific text does not include an entity thumbnail, and in this case, the specific text does not include a corresponding target thumbnail.
S106: and taking the appointed text as a training sample, and determining the label of the training sample according to the target thumbnail corresponding to the appointed text.
Further, the method adopts a machine learning mode, takes the appointed text as a training sample, and determines the label of the training sample by the target thumbnail corresponding to the appointed text. Specifically, the method for determining the labeling of the training sample by using the target abbreviation corresponding to the specified text may be any entity labeling method existing at present, such as a sequence labeling method of BIO ternary labeling, BMES quaternary labeling or BIOES five-element labeling, which is not limited in this application.
For example, for a specified text "low-entry electrically operated gate flange bleed," where "low-entry" is the target thumbnail of the specified text, the label of the specified text may be "beooooooooooo.
S108: and training an entity recognition model according to the training sample and the label of the training sample.
Specifically, a supervised learning mode is adopted to train the entity recognition model, namely, a training sample is input into the entity recognition model to be trained to obtain a prediction label corresponding to the training sample output by the entity recognition model, and the entity recognition model is trained by taking the minimization of the difference between the prediction label and the label of the training sample as a training target. The model structure of the entity recognition model may be any existing machine learning model structure that CAN be applied to the entity recognition task, such as Lattice LSTM, CAN-NER, flag, etc., which is not limited in this application.
According to the model training method provided by the description, the reference index of each entity abbreviation is determined according to the first frequency of the entity abbreviation contained in the appointed text and the second frequency of each entity abbreviation in the universal corpus, the target abbreviation corresponding to the appointed text is determined from each entity abbreviation according to the reference index of each entity abbreviation, further the appointed text is used as a training sample, the label of the training sample is determined according to the target abbreviation of the appointed text, and the entity recognition model is trained based on the training sample and the label thereof. The purpose of automatically excavating the abbreviations can be achieved through the scheme, manual labeling is not needed, and labor cost is reduced. And the labeling mode of the training sample is determined based on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the general corpus, so that the labeling accuracy can be improved, and the recognition accuracy of the entity recognition model is further improved.
In one or more embodiments of the present application, in determining that each entity abbreviation included in the specified text in step S100 in fig. 1, in addition to extracting the entity abbreviation from the specified text based on the method of manual labeling, the entity abbreviation may also be extracted from the specified text based on the way of matching the target tree, which is specifically as follows, as shown in fig. 2:
S200: and obtaining a target tree, wherein the target tree is constructed based on a plurality of entity full-name words and each reference abbreviation word corresponding to each entity full-name word.
Specifically, when the entity abbreviation is extracted from the specified text as in step S100 of fig. 1, since the specified text may be a text in a specific language domain, the entity abbreviation included in the specified text may be an abbreviation that is not commonly used in a general language domain, thereby increasing the threshold of the labeling personnel. Meanwhile, the situations of entity abbreviation omission, errors and the like obtained based on manual labeling can exist. In order to reduce the labor consumption and improve the accuracy of the entity abbreviations extracted from the appointed text, a target tree matching-based method can be adopted to obtain each entity abbreviation contained in the execution text. Wherein the target tree may be constructed based on known entity abbreviations. And known entity abbreviations may be enumerated based on known entity full names.
Based on this, the target tree acquired in this step may be constructed based on a plurality of entity holonomics, and each reference abbreviation corresponding to each entity holonomics, respectively. The entity full term and the appointed text belong to a specific language field, and of course, the entity full term can also originate from a general language library, which is not limited in the application.
Optionally, in the embodiment of the present application, at least one entity acronym may be obtained by performing abbreviation on the entity full-term word, where in general, in the chinese field, the number of words of the entity acronym may have a definite meaning that indicates an entity only when two or more words are included, and the number of words of the entity full-term word is generally greater than the number of words of the entity acronym, so in the embodiment of the present application, the number of words of the entity full-term word may be two or more, and the number of words of the entity full-term word is greater than the number of words of the entity acronym, such as three or more words, and the specific numbers of the entity full-term word and the entity acronym are not limited in the present application.
Further, because the constructed target tree is used for matching with the appointed text and extracting entity abbreviations contained in the appointed text, the target tree can be constructed according to the abbreviations corresponding to the entity full-term words. Thus, before the target tree is constructed, each reference abbreviation corresponding to each entity full term needs to be determined according to each entity full term.
The manner of determining the reference abbreviation corresponding to the entity full term may be an enumeration subset manner based on incremental construction, or an artificial exhaustion manner, which is not limited in this application.
For example, for the entity full term "low pressure heater," reference abbreviations thereof may be "low pressure," "low heat," "pressure-heated," "autoclave," "low pressure-heated," and the like.
It should be noted that, in this step, each reference abbreviation corresponding to a determined entity full term is not a common and accurate abbreviation form of the entity full term, for example, for the reference abbreviation "low pressure", it may be a reference abbreviation of "low pressure heater" or a reference abbreviation of another entity full term, such as "low pressure cylinder". It can be seen that the same reference acronym may correspond to different entity full names and, therefore, the reliability and accuracy of referring to an entity full name may not be high for each reference acronym obtained in this step.
Optionally, after determining each reference abbreviation corresponding to each entity full term, a correspondence between the entity full term and each reference abbreviation of the entity full term may be constructed and stored for each entity full term.
Specifically, the target tree is composed of a plurality of nodes, and each node may have a connection relationship, and extends downward with the root node as the starting node until the root node extends to the leaf node. Generally, for each node in the target tree, the node adjacent to and ordered before the node in the target tree is the parent node of the node, and the node adjacent to and ordered after the node in the target tree is the child node of the node. Thus, in the embodiment of the present application, in the target tree constructed based on each entity full term and each reference acronym, a node may correspond to one word or a plurality of words (words), and for each node, a word or word corresponding to a parent node of the node is arranged in front of the word or word corresponding to the node in the entity full term and/or each reference acronym, and a word or word corresponding to a child node of the node is correspondingly arranged behind the word or word corresponding to the node in the entity full term and/or each reference acronym.
In one or more embodiments of the present application, the tree structure adopted by the target tree may be determined according to a specific application scenario, and may be any existing tree structure such as a binary tree, a multi-tree, a clue tree, and the like, which is not limited in this application.
Alternatively, the target tree may be a thread tree, where the thread tree is based on a binary tree or a multi-way tree, and for each node included in the tree, the thread corresponds to a thread, and the thread is a pointer to a predecessor node and a successor node of the node. Based on this, for each node included in the target tree, a thread of the node is determined according to the parent node and the child node connected to the node, and the thread of each node included in the target tree is stored, so that when matching is performed based on the target tree and the specified text subsequently, a path in the target tree on which the specified text hits can be determined based on the thread of each node included in the target tree.
In addition, in one or more embodiments of the present application, the number of nodes included in each level of the target tree and the depth of the target tree are not limited.
For example, still taking the above "low-pressure heater" as an example, the target tree constructed based on the entity full term and its reference acronyms may be as shown in fig. 3, where each node corresponds to a word in fig. 3, and the "low" node is taken as a root node, and extends downward in sequence according to the arrangement order of the words included in the entity full term and the reference acronyms until reaching the last word.
S202: and matching the appointed text with the target tree to obtain each entity abbreviation contained in the appointed text.
And when the target tree is matched with the appointed text, if each word contained in the appointed text can hit one path in the target tree, the entity full term or the entity abbreviation contained in the appointed text can be determined according to the word or the word corresponding to each node contained in the hit path. The matching mode can be an AC automaton-based mode.
For example, taking the target tree shown in fig. 3 as an example, when the designated text is "low-adding steam-inlet electric door flange oil seepage", each word in the designated text is sequentially matched with the root node of the target tree, when the word which is the same as the root node is matched, a path is searched on the target tree, based on the target tree shown in fig. three, the searched path of the designated text is that the root node points to the node corresponding to the "adding" word, and then the entity abbreviation contained in the designated text is determined to be "low adding".
Based on the method shown in fig. 2, the matching is performed according to the specified text and the target tree, and the entity abbreviation is automatically extracted from the specified text. Through the scheme, the purpose of automatically determining entity abbreviations in the appointed text can be achieved, manual labeling is not needed, and labor consumption is reduced.
In one or more embodiments of the present application, in the target tree in step S200 shown in fig. 2, the target tree constructed based on the plurality of entity full-terms and the corresponding reference acronyms may include a plurality of nodes, each of which may correspond to a word in the entity full-term or the reference acronym, based on which the target tree may be constructed as follows, as shown in fig. 4:
s300: and determining the connection sequence of the nodes according to the arrangement sequence of the words contained in the entity full-term words and the arrangement sequence of the words contained in the reference abbreviations, wherein the nodes are determined by the words contained in the entity full-term words and the words contained in the reference abbreviations.
The target tree contains a plurality of nodes, and different nodes may correspond to different words, so that different paths contained by the target tree correspond to words made up of different words. In order to construct the target tree, each word included in each entity full term and each word included in each reference abbreviation can be used as a node.
Specifically, for the entity full term or the reference acronym, there is an arrangement sequence among the words that form the term, that is, the words need to be arranged in a certain arrangement sequence in order to be combined into the entity full term or the reference acronym. Therefore, in the target tree, the connection order of the nodes can be determined according to the arrangement order of the words contained in the full-term words of the entities and the arrangement order of the words contained in the reference abbreviations.
For each word in each word included in the entity full term, determining a word adjacent to the word and arranged in front of the word as a previous word of the word according to the arrangement sequence of the words included in the entity full term, and determining a word adjacent to the word and arranged in front of the word as a next word of the word. Reference is made to the abbreviation.
Since each word included in the entity full term and each word included in the reference acronym are used as each node in the step S300, the connection order between each node can be determined based on the relationship between each word included in the entity full term and each word included in the reference acronym.
S302: and determining a target node from the nodes, and determining a father node of the target node and a child node of the target node according to the connection sequence of the nodes.
In this step, a target node is determined from the nodes, where the target node may be any one or more of the nodes, or may refer to each of the nodes. In practical application, the parent node and the child node of each node can be determined in turn according to the connection sequence of each node.
Specifically, for the target node, a parent node connected with the target node is determined according to the previous word of the word corresponding to the target node in the whole word of each entity and each reference abbreviation, and a child node connected with the target node is determined according to the next word of the word corresponding to the target node in the whole word of each entity and each reference abbreviation.
For a target node in each node, there may be multiple parent nodes and/or multiple child nodes for the target node, which is not limited in this application.
S304: and constructing a target tree according to the target node, the father node of the target node and the child node of the target node.
Specifically, a parent node of the target node, the target node and a child node of the target node are sequentially connected to form at least one branch of the target tree. Traversing a plurality of target nodes, forming a plurality of branches according to each target node and father nodes and child nodes thereof, and constructing a target tree according to each branch.
When the node does not have a parent node, the word corresponding to the node is determined to be the first word in the full term of each entity and each reference abbreviation, and the node can be used as the root node of the target tree. When the node does not have a child node, determining that the word corresponding to the node is used as an end word in all the entity full-name words and all the reference abbreviations, and using the node as a leaf node of the target tree.
In an optional embodiment of the present application, in step S202 in fig. 2, the specified text and the target tree are matched to obtain each entity abbreviation included in the specified text, and since the node included in the target tree corresponds to the word included in the entity full term and the reference abbreviation, the entity abbreviation included in the specified text may be determined according to the node hit by the specified text on the target tree, which specifically includes the following scheme:
the first step: and sequentially matching each word contained in the specified text with each node of the target tree, and determining a target path hit by the specified text in the target tree.
In the embodiment of the application, the specified text may refer to a single text, and the specified text may include words formed by a plurality of words. Based on this, when matching the specified text with the target tree, each word included in the specified text can be sequentially matched with each node on the target tree. Since, in the embodiment of the present application, each node included in the object tree may correspond to a single word or be a word, and the word formed by the plurality of nodes included in the object tree actually corresponds to the entity full term and the reference acronym as referred to in step S200 and step S202 in fig. 2, when the word or word included in the specified text can be matched with the word or word corresponding to the node of the object tree, it is explained that the entity acronym is included in the specified text.
Specifically, when a word or a word included in the specified text can be matched with a word or a word corresponding to a node of the target tree, a plurality of words included in the specified text hit a plurality of nodes in the target tree, the plurality of nodes in the target tree can form a path, and a path formed by connecting the plurality of nodes hit by the specified text on the target tree is used as a target path hit by the specified text in the target tree. Of course, the specified text may further include a plurality of entity abbreviations, and when the specified text matches with the target tree, a plurality of target paths in the target tree may be hit.
And a second step of: and determining each entity abbreviation contained in the appointed text according to the words corresponding to each node on the target path.
And sequentially connecting the words corresponding to the nodes on the target path according to the connection sequence indicated by the target path, so that each entity abbreviation contained in the specified text can be determined.
In an alternative embodiment of the present application, when each word included in the specified text is matched with each node of the target tree in the first step, there may be a case that the specified text hits multiple paths on the target tree, which may be caused by that a plurality of entity words that are not nested with each other are included in the specified text, or may be caused by that entity nesting exists in the entity words included in the specified text. Based on the above, a target path for which the specified text hits in the target tree may be determined based on the following steps:
first, each word included in the specified text is matched with each node of the target tree in sequence, and each path hit by the specified text in the target tree is determined.
In this step, each path of the target tree hit by the specified text is determined, which is similar to the first step of the above step, and will not be described herein.
Next, the number of nodes contained in each path hit by the specified text in the target tree is determined.
When the specified text contains a plurality of entity words which are not nested with each other, or the specified text contains entity words which are nested with each other, the specified text hits a plurality of paths in the target tree, and the number of nodes contained in each path may be different, or may be the same. Specifically, the number of nodes included in each of the paths on the target tree hit by the plurality of entity words that are not nested may be the same or may be different; for entity words in which entity nesting exists, the number of nodes included in each of the paths is generally different.
And then, determining a target path hit by the specified text in the target tree according to the paths with the node number larger than a preset threshold value.
In the step, a path with the node number larger than a preset threshold value is used as a target path for the designated text to hit, so that in the subsequent step, the entity abbreviation contained in the designated text is determined by the characters or words corresponding to the nodes contained in the target path. The purpose of reserving the path with more nodes as the target path is to avoid the problems that the word number of entity abbreviations extracted from the appointed text is too small, and the abbreviations with less word numbers can not accurately represent the meaning of the entity full-term words and generate ambiguity, and even if the accurate abbreviations with some entity full-term words contain fewer words, the accuracy of reserving the entity abbreviations with more word numbers is higher than that of reserving the entity abbreviations with fewer word numbers.
The preset threshold may be set manually in advance, or may be determined based on a statistical value of the number of nodes included in each path of the specified text hit, which is not limited in this application.
In an alternative embodiment of the present application, before determining the reference index of each entity abbreviation in step S102 in fig. 1, it is further determined that no entity full term exists in each entity abbreviation included in the specified text determined in step S100.
Specifically, because the target tree may be constructed based on the entity full-term and the reference abbreviation thereof, a path on the target tree may correspond to the entity full-term, based on which, after the word is extracted based on the matching of the target tree and the specified text, the word may be the entity full-term or the entity abbreviation.
In one or more embodiments of the present application, in determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the specified text and the second frequency of each entity abbreviation in the generic corpus as shown in step S102 of fig. 1, the determining of the reference index by the first frequency and the second frequency may have different contributions, based on which the above step S102 may be further implemented based on the following scheme, as shown in fig. 5:
S400: and determining a first weight and at least one second weight corresponding to each entity abbreviation respectively. The first weights are weights corresponding to the first frequencies of the entity abbreviations in the corresponding appointed texts, and each second weight is a weight corresponding to the second frequencies of the entity abbreviations in the corresponding general corpus.
Specifically, according to the above step S102, the first frequency of the entity abbreviation in the specified text may represent the reliability and accuracy of the entity abbreviation in the specific language domain corresponding to the specified text, and the second frequency of the entity abbreviation in the generic corpus may represent the reliability and accuracy of the entity abbreviation in the generic language domain. The reference index of an entity abbreviation is used to indicate the probability that the entity abbreviation refers to the corresponding entity. Therefore, the reference index of the entity abbreviation may be determined according to the first frequency and the second frequency, and when the reference index is determined based on the first frequency and the second frequency, the contribution degree of the first frequency to the determined reference index and the contribution degree of the second frequency to the determined reference index may be different, i.e. the first frequency and the second frequency respectively correspond to different weights.
In practical application, a plurality of general corpora with different emphasis can be obtained, and text corpora contained in the general corpora are general corpora, but the abbreviated forms of entity words contained in the text corpora in the general corpora with different emphasis may be different. Thus, determining the second frequency among a plurality of different generic corpus may further improve the accuracy of the probability that the determined entity abbreviation refers to the corresponding entity.
Thus, in the present application, there may be one or more general corpora, so that when there is only one general corpus, a second weight corresponding to a second frequency of the entity abbreviations in the one general corpus may be determined. When there are multiple general corpora, the second frequencies of the entity abbreviations in the general corpora refer to the second frequencies obtained by the same entity abbreviations in the multiple different general corpora, so that the second weights corresponding to the second frequencies of the entity abbreviations in the multiple different general corpora can be determined. Thus, in this step, a first weight corresponding to the entity abbreviation is determined, where the first weight is a weight corresponding to the entity abbreviation in a corresponding specified text at a first frequency. And determining at least one second weight corresponding to the entity abbreviation, wherein each second weight is a weight corresponding to the second frequency of the entity abbreviation in the corresponding general corpus. The first weight and the size of each second weight can be determined according to a specific application scenario, which is not limited in this application.
In practical applications, if it is determined that the entity abbreviation refers to the probability of the corresponding entity, and more depending on the frequency of the entity abbreviation used in the specified text in the specific language domain, the first weight of the first frequency may be set to be higher than the second weight of the second frequency. Correspondingly, if it is determined that the entity abbreviation refers to the probability of the corresponding entity, and more dependent on the frequency with which the entity abbreviation is used in the universal text in the universal language field, a first weight of the first frequency may be set to be lower than a second weight of the second frequency. Of course, if the probability that an entity thumbnail refers to a corresponding entity is determined, the first weight of the first frequency may be the same as the second weight of the second frequency for the frequency with which the entity thumbnail is used in the specified text of the particular language field and the degree of dependence of the frequency with which the entity thumbnail is used in the generic text of the generic language field is comparable.
Alternatively, the first weight and the second weight may be determined as follows: the method comprises the steps of obtaining a first reference abbreviation and a second reference abbreviation of a reference entity full term in advance, and obtaining a size relation between a reference index of the first reference abbreviation referring to the reference entity full term and a reference index of the second reference abbreviation referring to the reference entity full term as a reference relation. Initializing a first weight and a second weight; determining a first frequency of the first reference abbreviation in the first corpus and a second frequency of the first reference abbreviation in the second corpus, and carrying out weighted summation on the first frequency and the second frequency of the first reference abbreviation according to the initialized first weight and the second weight to obtain a prediction index of the first reference abbreviation; determining a first frequency of the second reference abbreviation in the first corpus and a second frequency of the second reference abbreviation in the second corpus, and carrying out weighted summation on the first frequency and the second frequency of the second reference abbreviation according to the initialized first weight and the second weight to obtain a prediction index of the second reference abbreviation; and optimizing the first weight and the second weight according to the prediction relation between the prediction index of the first reference abbreviation and the prediction index of the second reference abbreviation and the minimization of the difference between the reference relations, so as to obtain the optimized first weight and second weight.
S402: and according to the first weight and the second weight, carrying out weighted summation on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the universal corpus, and obtaining the reference index of the entity abbreviation.
In W n Reference index for representing entity abbreviation, S n Representing a first frequency, T, of entity abbreviations in corresponding specified text ni The second frequency of the entity abbreviation in the ith general corpus is represented, and in the application, a determination formula of the reference index of the entity abbreviation is optionally as follows:
W n =aS n +bS n1 +cS n2 +…+mS nm
wherein a represents a first weight corresponding to the first frequency, and b and c … … m represent second weights corresponding to the second frequency respectively. i represents the number of the general corpus, and the value range of i is [1, m ]. And a+b+c+ … … +m=1.
In this specification, fig. 6 is a schematic flow chart of an entity identification method provided in the present application. The entity recognition model used for executing the entity recognition method may be trained based on the training method shown in any of the diagrams in fig. 1 to 5. The following describes in detail the technical scheme of the entity identification method provided in the present application with reference to the accompanying drawings.
S500: acquiring an entity identification request; wherein the entity recognition request corresponds to text to be recognized.
After obtaining the trained entity recognition model based on the steps S100 to S108, the trained entity recognition model may be deployed in an electronic device executing the entity recognition method, and when the electronic device receives the entity recognition request, a text to be recognized may be obtained from the entity recognition request, where the text to be recognized is a text that is required to be recognized by the entity recognition request. The present specification does not limit whether the electronic device performing the entity recognition method and the electronic device performing the model training method are the same.
The text to be identified can be the text belonging to the same language field as the appointed text, and can also be a universal language text, and the language field, the word number and whether the entity abbreviation is actually contained in the text to be identified are not limited.
S502: obtaining entity abbreviations corresponding to the text to be identified by using an entity identification model; the entity recognition model is obtained based on a model training method.
And inputting the text to be recognized into the trained entity recognition model, determining the prediction label of each word contained in the text to be recognized through the output of the entity recognition model, and determining the entity abbreviation contained in the text to be recognized based on the prediction labels output by the model.
In one or more embodiments of the present application, the above step S502 may be further implemented according to the following scheme:
firstly, responding to an entity recognition request, and matching a text to be recognized with a corresponding relation, wherein the corresponding relation is used for representing the relation between the entity full term of a designated text and a corresponding target thumbnail.
After determining one or more target abbreviations corresponding to the specified text in step S104 of fig. 1, a correspondence may be established between the entity full-name word in the specified text and the one or more target abbreviations, and the correspondence may be stored. When the electronic equipment responds to the entity recognition request to perform entity recognition on the text to be recognized, the text to be recognized can be matched with the corresponding relation before the job level of the text to be recognized is input into the entity recognition model.
The corresponding relationship may refer to a binding relationship between an entity full term and one or more corresponding target abbreviations. The electronic equipment executing the entity identification method can store a plurality of corresponding relations, and different corresponding relations indicate binding relations between different entity full names and corresponding target search times. Based on the corresponding relation, one or more corresponding target abbreviations can be found through the entity full term. Therefore, the text to be recognized is matched with the corresponding relation, and if the text to be recognized hits the entity full-term in the corresponding relation, one or more target abbreviations corresponding to the entity full-term can be used as the entity abbreviations of the text to be recognized based on the corresponding relation.
The matching method of the text to be recognized and the corresponding relation can be a method of sequentially determining whether the entity full term in the corresponding relation is the same as the text to be recognized according to the sequence of each word contained in the text to be recognized.
And then, responding to successful matching, and taking the matched target abbreviation as an entity abbreviation contained in the text to be recognized.
If the matching is successful, it is indicated that the entity full-term word matched with the text to be recognized exists in the plurality of corresponding relations stored in the electronic equipment, at this time, one or more target abbreviations can be determined as the matched target abbreviations based on the matched entity full-term word and the corresponding relations, and the target abbreviations are used as the entity abbreviations corresponding to the text to be recognized. And moreover, an entity recognition model is not required to be adopted, so that the efficiency of entity recognition is improved.
Then, in response to the failure of matching, inputting the text to be recognized into a training entity recognition model to obtain entity abbreviations contained in the text to be recognized, which is output by the entity recognition model
If the entity full-term word matched with the text to be recognized does not exist in a plurality of corresponding systems stored in the electronic equipment, namely, when the matching fails, the entity recognition is required to be carried out on the text to be recognized based on the trained entity recognition model, the entity recognition model outputs the prediction label of each word contained in the text to be recognized, and the entity thumbnail contained in the text to be recognized can be obtained based on the prediction labels of the words.
The above model training method and entity recognition method provided for one or more embodiments of the present disclosure further provide a corresponding model training apparatus and entity recognition apparatus based on the same concept, as shown in fig. 7 and 8.
Fig. 7 is a schematic diagram of a model training device provided in the present application, specifically including:
a first obtaining module 600, configured to obtain a specified text, and determine each entity abbreviation included in the specified text;
the reference index determining module 602 is configured to determine, according to a first frequency of the entity abbreviations in the specified text and a second frequency of the entity abbreviations in the universal corpus, reference indexes of the entity abbreviations respectively;
the target abbreviation determining module 604 is configured to determine a target abbreviation corresponding to the specified text from the entity abbreviations according to the reference index of the entity abbreviations;
the label determining module 606 is configured to determine a label of the training sample according to a target thumbnail corresponding to the specified text by using the specified text as the training sample;
and the training module 608 is used for training an entity recognition model according to the training sample and the label of the training sample.
Optionally, the first obtaining module 600 is specifically configured to obtain a target tree; the target tree is constructed based on a plurality of entity full-name words and each reference abbreviation word corresponding to each entity full-name word; and matching the appointed text with the target tree to obtain each entity abbreviation contained in the appointed text.
Optionally, the apparatus further comprises:
the target tree construction module 610 is specifically configured to determine a connection sequence of each node according to an arrangement sequence of each word included in the full term of each entity and an arrangement sequence of each word included in each reference abbreviation; each node is determined by each word contained in the entity full term and each word contained in each reference abbreviation; determining a target node from the nodes, and determining a father node of the target node and a child node of the target node according to the connection sequence of the nodes; and constructing a target tree according to the target node, the father node of the target node and the child node of the target node.
Optionally, the first obtaining module 600 is specifically configured to match each word included in the specified text with each node of the target tree in sequence, and determine a target path hit by the specified text in the target tree; and determining each entity abbreviation contained in the appointed text according to the words corresponding to each node on the target path.
Optionally, the first obtaining module 600 is specifically configured to match each word included in the specified text with each node of the target tree in sequence, and determine each path hit by the specified text in the target tree; determining node numbers contained in each path hit by the specified text in the target tree; and determining a target path hit by the specified text in the target tree according to the paths of which the node number is larger than a preset threshold value.
Optionally, the reference index determining module 602 is specifically configured to determine a first weight and at least one second weight corresponding to each entity abbreviation respectively; the first weights are weights corresponding to the first frequencies of the entity abbreviations in the corresponding appointed texts, and each second weight is a weight corresponding to the second frequencies of the entity abbreviations in the corresponding general corpus; and according to the first weight and the second weight, carrying out weighted summation on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the universal corpus, and obtaining the reference index of the entity abbreviation.
Fig. 8 is a schematic diagram of an entity identification device provided in the present application, which specifically includes:
A request acquisition module 700, configured to acquire an entity identification request; wherein the entity recognition request corresponds to text to be recognized;
the entity abbreviation determining module 702 is configured to obtain an entity abbreviation corresponding to the text to be recognized by using an entity recognition model; the entity recognition model is obtained based on any model training method.
Optionally, the apparatus further comprises:
the matching module 704 is specifically configured to match the text to be identified with the corresponding relationship in response to the entity identification request; the corresponding relation is used for representing the relation between the entity full title of the appointed text and the corresponding target abbreviation; responding to successful matching, and taking the matched target abbreviation as an entity abbreviation contained in the text to be identified;
optionally, the entity abbreviation determining module 702 is specifically configured to input, in response to a matching failure, the text to be recognized into a training entity recognition model, and obtain entity abbreviations included in the text to be recognized output by the entity recognition model.
The present application also provides a computer-readable storage medium storing a computer program operable to perform the model training method shown in fig. 1 and the entity recognition method shown in fig. 6 described above.
The present application also provides a schematic block diagram of the electronic device shown in fig. 7. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 9, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the same to implement the model training method shown in fig. 1 and the entity recognition method shown in fig. 6. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
All embodiments in the application are described in a progressive manner, and identical and similar parts of all embodiments are mutually referred, so that each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (12)

1. A method of model training, comprising:
acquiring a specified text and determining each entity abbreviation contained in the specified text;
according to the first frequency of each entity abbreviation in the appointed text and the second frequency of each entity abbreviation in the general corpus, respectively determining the reference index of each entity abbreviation;
determining a target abbreviation corresponding to the appointed text from the entity abbreviations according to the reference index of the entity abbreviations;
Taking the appointed text as a training sample, and determining the label of the training sample according to the target abbreviation corresponding to the appointed text;
and training an entity recognition model according to the training sample and the label of the training sample.
2. The method of claim 1, wherein determining each entity abbreviation contained in the specified text comprises:
obtaining a target tree; the target tree is constructed based on a plurality of entity full-name words and each reference abbreviation word corresponding to each entity full-name word;
and matching the appointed text with the target tree to obtain each entity abbreviation contained in the appointed text.
3. The method according to claim 2, wherein the process of constructing the target tree specifically comprises:
determining the connection sequence of each node according to the arrangement sequence of each word contained in the full term of each entity and the arrangement sequence of each word contained in each reference abbreviation; each node is determined by each word contained in the entity full term and each word contained in each reference abbreviation;
determining a target node from the nodes, and determining a father node of the target node and a child node of the target node according to the connection sequence of the nodes;
And constructing a target tree according to the target node, the father node of the target node and the child node of the target node.
4. The method of claim 2, wherein matching the specified text with the target tree to obtain each entity abbreviation contained in the specified text comprises:
sequentially matching each word contained in the specified text with each node of the target tree, and determining a target path hit by the specified text in the target tree;
and determining each entity abbreviation contained in the appointed text according to the words corresponding to each node on the target path.
5. The method of claim 4, wherein determining a target path for the specified text to hit in the target tree, comprises:
sequentially matching each word contained in the appointed text with each node of the target tree, and determining each path hit by the appointed text in the target tree;
determining node numbers contained in each path hit by the specified text in the target tree;
and determining a target path hit by the specified text in the target tree according to the paths of which the node number is larger than a preset threshold value.
6. The method of claim 1, wherein determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the specified text and the second frequency of each entity abbreviation in the generic corpus comprises:
determining a first weight and at least one second weight corresponding to each entity abbreviation respectively; the first weights are weights corresponding to the first frequencies of the entity abbreviations in the corresponding appointed texts, and each second weight is a weight corresponding to the second frequencies of the entity abbreviations in the corresponding general corpus;
and according to the first weight and the second weight, carrying out weighted summation on the first frequency of the entity abbreviation in the appointed text and the second frequency of the entity abbreviation in the universal corpus, and obtaining the reference index of the entity abbreviation.
7. A method of entity identification, comprising:
acquiring an entity identification request; wherein the entity recognition request corresponds to text to be recognized;
obtaining entity abbreviations corresponding to the text to be identified by using an entity identification model; wherein the entity recognition model is obtained based on the model training method according to any one of claims 1 to 6.
8. The method of claim 7, wherein the method further comprises:
responding to the entity identification request, and matching the text to be identified with the corresponding relation; the corresponding relation is used for representing the relation between the entity full title of the appointed text and the corresponding target abbreviation;
responding to successful matching, and taking the matched target abbreviation as an entity abbreviation contained in the text to be identified;
the obtaining the entity abbreviation corresponding to the text to be identified by using the entity identification model comprises the following steps:
and in response to the failure of matching, inputting the text to be identified into a training entity identification model to obtain entity abbreviations contained in the text to be identified, which is output by the entity identification model.
9. A model training device, comprising:
the first acquisition module is used for acquiring a specified text and determining each entity abbreviation contained in the specified text;
the reference index determining module is used for determining the reference index of each entity abbreviation according to the first frequency of each entity abbreviation in the appointed text and the second frequency of each entity abbreviation in the universal corpus;
The target abbreviation determining module is used for determining target abbreviations corresponding to the specified texts from the entity abbreviations according to the reference indexes of the entity abbreviations;
the annotation determining module is used for taking the appointed text as a training sample and determining the annotation of the training sample according to the target abbreviation corresponding to the appointed text;
and the training module is used for training the entity recognition model according to the training sample and the label of the training sample.
10. An entity identification device, comprising:
the request acquisition module is used for acquiring an entity identification request; wherein the entity recognition request corresponds to text to be recognized;
the entity abbreviation determining module is used for obtaining entity abbreviations corresponding to the text to be recognized by utilizing an entity recognition model; wherein the entity recognition model is obtained based on the model training method according to any one of claims 1 to 6.
11. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-6 or 7-8.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 or 7-8 when executing the program.
CN202311207574.1A 2023-09-18 2023-09-18 Model training and entity recognition method Pending CN117313727A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311207574.1A CN117313727A (en) 2023-09-18 2023-09-18 Model training and entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311207574.1A CN117313727A (en) 2023-09-18 2023-09-18 Model training and entity recognition method

Publications (1)

Publication Number Publication Date
CN117313727A true CN117313727A (en) 2023-12-29

Family

ID=89259495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311207574.1A Pending CN117313727A (en) 2023-09-18 2023-09-18 Model training and entity recognition method

Country Status (1)

Country Link
CN (1) CN117313727A (en)

Similar Documents

Publication Publication Date Title
CN111488426B (en) Query intention determining method, device and processing equipment
JP6894058B2 (en) Hazardous address identification methods, computer-readable storage media, and electronic devices
CN111310456B (en) Entity name matching method, device and equipment
US8457950B1 (en) System and method for coreference resolution
JP5424001B2 (en) LEARNING DATA GENERATION DEVICE, REQUESTED EXTRACTION EXTRACTION SYSTEM, LEARNING DATA GENERATION METHOD, AND PROGRAM
CN112036162B (en) Text error correction adaptation method and device, electronic equipment and storage medium
CN117235226A (en) Question response method and device based on large language model
CN110363049B (en) Method and device for detecting, identifying and determining categories of graphic elements
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN113221555B (en) Keyword recognition method, device and equipment based on multitasking model
CN112417093A (en) Model training method and device
CN116303989A (en) Patent retrieval method, device and equipment for multiple retrieval scenes
CN113887206B (en) Model training and keyword extraction method and device
CN111858860B (en) Search information processing method and system, server and computer readable medium
CN112948449A (en) Information recommendation method and device
CN114626378B (en) Named entity recognition method, named entity recognition device, electronic equipment and computer readable storage medium
CN117313727A (en) Model training and entity recognition method
CN113010573A (en) Incidence relation extraction method and device and electronic equipment
US11934794B1 (en) Systems and methods for algorithmically orchestrating conversational dialogue transitions within an automated conversational system
Shams et al. Lexical intent recognition in urdu queries using deep neural networks
CN115017899B (en) Abbreviation generation method, apparatus, device and storage medium
CN116340469B (en) Synonym mining method and device, storage medium and electronic equipment
CN117034942B (en) Named entity recognition method, device, equipment and readable storage medium
CN117573849B (en) Knowledge graph multi-hop question-answering method, device, equipment and storage medium
CN116070916B (en) Data processing method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination