CN117077679A - Named entity recognition method and device - Google Patents
Named entity recognition method and device Download PDFInfo
- Publication number
- CN117077679A CN117077679A CN202311332338.2A CN202311332338A CN117077679A CN 117077679 A CN117077679 A CN 117077679A CN 202311332338 A CN202311332338 A CN 202311332338A CN 117077679 A CN117077679 A CN 117077679A
- Authority
- CN
- China
- Prior art keywords
- initial
- entity
- identified
- data
- obtaining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000001514 detection method Methods 0.000 claims description 16
- 230000014509 gene expression Effects 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 11
- 238000012986 modification Methods 0.000 claims description 9
- 230000004048 modification Effects 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 5
- 235000019580 granularity Nutrition 0.000 description 51
- 238000010586 diagram Methods 0.000 description 7
- 238000005457 optimization Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 229910021389 graphene Inorganic materials 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a named entity identification method and device. The method comprises the following steps: acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities; determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity; and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result. By adopting the method, the recognition of the professional named entity in the professional field can be realized efficiently and accurately.
Description
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a named entity recognition method and apparatus.
Background
Named entity recognition (Name Entity Recognition, NER for short) is a key task in the field of natural language processing, and the main purpose of the named entity recognition is to recognize and classify named entities with specific meaning from texts, and is a technical support for practical application such as information extraction, question-answering systems, knowledge graph construction and the like. Therefore, there is an urgent need for efficient and accurate NER technology, both in academia and industry.
Some scholars have already tried to assist named entity recognition task by using large language models (Large Language Models, abbreviated as LLMs), but these models mainly aim at general named entities, such as names of people, places, organizations, etc., rather than non-general expert knowledge in specific fields (such as astronomical fields), which also results in that the existing named entity recognition technology cannot meet the requirement of professionals for expert knowledge in specific fields.
At present, no effective solution has been proposed for how to efficiently and accurately identify a professional named entity in a professional field.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a named entity recognition method and device for solving the above technical problems.
In a first aspect, the present application provides a named entity recognition method. The method comprises the following steps:
acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities;
determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
In one embodiment, the expertise database includes at least one data type, each data type corresponding to at least two granularity entities; obtaining a value score corresponding to the granularity entity one by one for the granularity entity comprises:
obtaining a scoring model with complete training; the scoring model is obtained according to an initial example corresponding to the data type in the expertise database and an initial scoring training corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data types based on the scoring model to obtain value scores corresponding to the granularity entities one by one.
In one embodiment, obtaining the initial score includes:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and a preset instruction template, and obtaining initial accuracy rate corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current initial example of the initial examples, and determining a current accuracy rate corresponding to the current data combinations from the initial accuracy rates;
scoring: obtaining a current initial score aiming at a current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the scoring step until all the initial examples are traversed to obtain initial scores corresponding to the initial examples one to one.
In one embodiment, after obtaining the at least one hint instruction template, the method further comprises:
acquiring a preset detection text;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
based on the initial recognition result and the prompt instruction template, carrying out matching calculation to obtain a template accuracy result corresponding to the prompt instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
In one embodiment, after obtaining the fuzzy entity identification result, the method further includes:
acquiring a potential verification template, judging the fuzzy entity recognition result based on the potential verification template to obtain a judging recognition result, and correcting the fuzzy entity recognition result based on the judging recognition result to obtain a target fuzzy recognition result;
and obtaining a final entity identification result based on the target fuzzy identification result and the named entity identification result.
In one embodiment, acquiring data to be identified includes:
acquiring initial data to be identified;
the method comprises the steps of performing segmentation processing on initial to-be-identified data to obtain initial to-be-identified text blocks, and obtaining text block similarity between the initial to-be-identified text blocks based on the space distance between the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain data to be identified.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to preset display equipment for display processing.
In one embodiment, after the knowledge graph is sent to a preset display device for display processing, the method further includes:
acquiring a database modification instruction aiming at a professional knowledge database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
In a second aspect, the application further provides an entity identification device oriented to the professional field. The device comprises:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
According to the named entity recognition method and device, on one hand, the named entity recognition result which is more accurately aimed at the technical field of expertise can be extracted from the text to be recognized through the target entity obtained from the expertise database and the prompt instruction template obtained by the preset instruction template; on the other hand, the prompt instruction template formed by the target entity in the expertise database is obtained based on the value score, so that the preset instruction template can be better exemplified, and the identification of the data to be identified is more accurate and efficient.
Drawings
FIG. 1 is a diagram of an application environment for a named entity recognition method in one embodiment;
FIG. 2 is a flow diagram of a named entity recognition method in one embodiment;
FIG. 3 is a schematic diagram of a knowledge graph illustrating astronomical area in one embodiment;
FIG. 4 is an optimization diagram based on database modification instructions in another embodiment;
FIG. 5 is a flow chart of a named entity recognition method in accordance with a preferred embodiment;
FIG. 6 is a block diagram of a named entity recognition device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The named entity identification method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. Firstly, acquiring a professional knowledge database comprising granularity entities, secondly, determining a value score corresponding to the granularity entities, determining a target entity from the granularity entities based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template; and finally, generating first text information to be identified according to the prompt instruction template and the data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a named entity recognition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S202, acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities.
The indication contained in the expertise database is to analyze and label a small amount of literature data in astronomical field by means of expert knowledge, specifically to manually label the granularity entities by means of expert knowledge, construct a multi-granularity entity labeling data set, define the structuring relation among the entities and form the expertise database aiming at the expertise field rather than the general field. The granularity in the granularity entity reflects various granularities of the entity in the database, such as a professional knowledge database for astronomy, the granularity entity can comprise coarser granularities, such as celestial names, telescope names and the like, and further the granularity entity can also comprise more detailed granularities, such as sun, moon and the like.
Step S204, determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity.
The value scoring is in a one-to-one correspondence with the granularity entity, and it can be understood that the value scoring can be performed manually on the granularity entity, or can be performed on the granularity entity based on a neural network with complete training, the value scoring result indicates the matching degree of the granularity entity and the preset instruction template, after the value scoring corresponding to all the granularity entities is determined, the granularity entity with high value scoring is selected and combined with the preset instruction template to obtain a prompt instruction template, wherein the prompt instruction template is used for acquiring a corresponding named entity identification result aiming at the data to be identified.
Step S206, based on the prompt instruction template and the acquired data to be identified, generating first text information to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
The data to be identified may be a section of text or literature aiming at the technical knowledge field, and the data to be identified is nested in a prompt instruction template to obtain the first generation of identification text information, which can be understood that different named entity information can be extracted for the data to be identified according to different prompt instruction templates. Preferably, a named entity recognition process can be performed based on the first text information to be recognized by using a well-trained large language model, wherein the named entity recognition process corresponds to the granularity and the type of the target entity in the prompt instruction template.
Through steps S202 to S206, unlike the prior art, the format of the obtained named entity recognition result is generally improved in the prior art to obtain format content with a more standard and specialized format, and the method and the device optimize the target entity in the prompt instruction template to realize more accurate recognition of the text information to be recognized. Furthermore, through a plurality of granularity entities constructed in the expertise database, the application can be used for realizing the multi-granularity named entity recognition result by relevant technicians based on screening the target entity and screening the granularity of the target entity in actual application on the basis of ensuring the accuracy of the named entity recognition result.
In one embodiment, the expertise database includes at least one data type, each data type corresponding to at least two granularity entities; obtaining a value score corresponding to the granularity entity one by one for the granularity entity comprises:
obtaining a scoring model with complete training; the scoring model is obtained according to an initial example corresponding to the data type in the expertise database and an initial scoring training corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data types based on the scoring model to obtain value scores corresponding to the granularity entities one by one.
Specifically, the expertise database includes a plurality of data types, each data type corresponds to a large amount of granularity entities, and, taking the astronomical field as an example, the data types may include celestial names, barefoot, declination, and the like. And acquiring a plurality of initial examples corresponding to the preset instruction templates under one type, and acquiring initial scores corresponding to the initial examples, wherein the acquisition of the initial scores can be set manually, and the initial scores corresponding to the initial examples can be obtained based on preset detection texts. And training the scoring model based on the initial example and the initial scoring to obtain a training complete scoring model, and evaluating the rest granularity entities in the corresponding data types according to the training complete scoring model to obtain the value scores corresponding to the rest granularity entities. The method can flexibly and rapidly acquire the value scores of all granularity entities in the data types corresponding to the instruction templates in the database, has higher accuracy, only needs to carry out a small amount of marking manually, does not need to carry out a large amount of marking, and saves labor cost and computing resources.
In one embodiment, obtaining the initial score includes:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and a preset instruction template, and obtaining initial accuracy rate corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current initial example of the initial examples, and determining a current accuracy rate corresponding to the current data combinations from the initial accuracy rates;
scoring: obtaining a current initial score aiming at a current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the scoring step until all the initial examples are traversed to obtain initial scores corresponding to the initial examples one to one.
Specifically, the data value marginal gain of adding a single initial example to the data combination can be expected to serve as the initial score, wherein specifically, initial accuracy corresponding to the initial example is firstly determined, the initial accuracy can be obtained according to a manual setting mode, or the initial accuracy of the initial example can be obtained according to the recognition result of the detected text on the basis of nesting an initial instruction template with the detected text. And after the initial accuracy is obtained, data combination is carried out, for example, after the current initial example and the corresponding current initial accuracy are determined, the other initial example is added, accuracy calculation is carried out according to the accuracy of the other initial example and the current initial accuracy corresponding to the current initial example, the accuracy corresponding to the two examples is obtained, and similarly, the combination related to the current initial example is arranged and calculated once, and the average value of the obtained current accuracy is the current initial score corresponding to the current initial example. And the like, the initial scores corresponding to all the initial examples can be calculated. The method can more accurately acquire the initial scores corresponding to the initial examples, and calculate the average value through the arrangement and combination of multiple groups of data, so that the initial scores can better reflect the matching degree of the initial examples and the corresponding initial instruction templates.
In one embodiment, after obtaining the at least one hint instruction template, the method further comprises:
acquiring a preset detection text; the detection text comprises preset labeling information;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
performing matching calculation based on the initial recognition result and the labeling information to obtain a template accuracy result corresponding to the prompting instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
Specifically, after a plurality of prompt instruction templates are obtained, testing the prompt instruction templates according to a preset detection text marked in advance, obtaining a corresponding initial recognition result aiming at the detection text, and matching the initial recognition result with marked information in the detection text to obtain a template accuracy result, wherein further, the marked information is generally marked manually and is used for indicating the prompt instruction templates corresponding to the entities in the detection text. If the accuracy result is smaller than a preset accuracy threshold, judging that the template is not achieved, deleting the template, and if the accuracy result is larger than or equal to the accuracy threshold, reserving, wherein the accuracy threshold can be set by a user at will, and the accuracy threshold is generally set to be 0.8. Through the method, screening of a plurality of prompt instruction templates can be completed, and the preset instruction templates and target examples corresponding to the preset instruction templates are synthesized and tested, so that more accurate and efficient prompt instruction templates are obtained.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
Specifically, the entity category information contained in the data to be identified is determined according to the named entity identification result, and the entity category information includes, but is not limited to, categories of celestial body names, declination corresponding to celestial bodies and the like, taking the astronomical field as an example. After determining the entity types contained in the data to be identified, correspondingly determining regular expressions, wherein the following table is an example of the regular expressions corresponding to different entity types:
taking the astronomical field as an example, the named entities in the astronomical field mostly display special formats, such as capital English letters plus numbers plus special characters, or numbers plus special characters plus numbers, etc. Therefore, fuzzy retrieval is carried out on the full text of the data to be identified based on the regular expression, the supplement of the named entity identification result is completed, potential named entities under the same text mode are mined, so that the named entity identification result and the fuzzy entity identification result can only be obtained in a strange mode, and the final entity identification result is obtained. By the method, the named entity recognition result is supplemented, so that the recognition result is more comprehensive, further, the regular expression is determined based on the named entity recognition result, the waste of calculation cost is avoided, and the retrieval efficiency is accelerated.
In one embodiment, after obtaining the fuzzy entity identification result, the method further includes:
acquiring a potential verification template, judging the fuzzy entity recognition result based on the potential verification template to obtain a judging recognition result, and correcting the fuzzy entity recognition result based on the judging recognition result to obtain a target fuzzy recognition result;
and obtaining a final entity identification result based on the target fuzzy identification result and the named entity identification result.
Specifically, the above potential verification template is used to help machine learning evaluate and improve the extracted result, taking astronomical field as an example, and the following table is an example of a potential verification template driving a large language model in one embodiment:
based on the fuzzy entity dug according to the regular expression as a question object and the data to be detected are embedded into the corresponding potential verification template, the potential verification template is judged by machine learning, and the correct named entity is required to be fed back or the incorrect named entity is required to be changed into the correct named entity to be used as the target fuzzy recognition result. According to the method, although the searching range is wider in the fuzzy searching, the accuracy is slightly insufficient compared with the entity identification result obtained based on the prompt instruction template, so that the fuzzy entity obtained based on the fuzzy searching is subjected to supplementary verification, and the higher accuracy is ensured while the searching range is enlarged.
In one embodiment, acquiring data to be identified includes:
acquiring initial data to be identified;
the method comprises the steps of performing segmentation processing on initial to-be-identified data to obtain initial to-be-identified text blocks, and obtaining text block similarity between the initial to-be-identified text blocks based on the space distance between the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain data to be identified.
Specifically, the initial to-be-identified data is usually long literature data, and the word number limitation of the template is considered in practical application, so that the initial to-be-identified data is subjected to segmentation processing to obtain a plurality of initial to-be-identified text blocks, so that the named entity identification task is used. And then searching in advance in a plurality of initial text blocks to be identified according to specific requirements on named entity identification, extracting a plurality of text blocks with relatively high correlation and relatively close spatial distance, summarizing the content of the text blocks or splicing a plurality of text blocks, thereby obtaining data to be identified. The method can realize simpler, more convenient and flexible operation in practical application, and can identify a plurality of text blocks based on high correlation, thereby further improving the identifying efficiency of the named entity.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to preset display equipment for display processing.
Specifically, fig. 3 is a schematic diagram of a knowledge graph, which is taken as an example in the astronomical field in one embodiment, and the specific content of the knowledge graph may be set by a relevant technician, as shown in fig. 3, and may include data such as thesis titles, celestial names, keywords, and the like. And further, in order to enable the user to have better viewing experience in actual application, according to the mata information provided by a literature management website when the data to be identified is a literature in the professional field, extracting topics, authors and author mechanisms corresponding to the literature as general academic knowledge entities in the knowledge graph, and combining the general academic knowledge entities with the named entity identification result in the professional field to perfect the knowledge graph. The method improves the ornamental value of the named entity recognition result, is closely attached to the actual application of the user, and can adapt to wider application scenes.
In one embodiment, after the knowledge graph is sent to a preset display device for display processing, the method further includes:
acquiring a database modification instruction aiming at a professional knowledge database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
Specifically, fig. 4 is an optimization schematic diagram based on a database modification instruction in an embodiment, where a user puts forward a requirement to a named entity recognition platform according to a requirement of own entity recognition, where the requirement sent by the user includes, but is not limited to, the data to be recognized and a prompt instruction template, and the named entity recognition platform obtains first text information to be recognized based on the data to be recognized and the prompt instruction template according to a content input by the user by using the method set forth in the foregoing, and performs named entity recognition processing on the first text information to be recognized to obtain a named entity recognition result. After the user obtains the corresponding knowledge graph according to the display equipment, based on the knowledge graph as the recognition result feedback, the expert knowledge database is further modified according to the knowledge graph, so that the knowledge database is further improved, and the knowledge optimization in fig. 4 is realized. By the method, the personalized expert knowledge database aiming at different users can be obtained according to the changing result of the users, and the efficiency and accuracy of entity identification are further improved.
The embodiment also provides a specific embodiment of a named entity recognition method, as shown in fig. 5, and fig. 5 is a flow chart of the named entity recognition method in a preferred embodiment.
Step S501, a expertise database is constructed. The construction of the professional knowledge database analyzes and marks a small amount of literature data in the professional field by means of expert knowledge in the professional field, specifically, the expert indication is used for carrying out a small amount of manual marking on granularity entities in the database, and the professional platform is used for combing out only the structure, so that a multi-granularity entity marking result is constructed, the structural relation among granularity entities is defined, and the expert knowledge database in the professional field is formed.
Step S502, a preset instruction template is obtained. Preferably, the preset instruction template may be a prompting instruction template constructed manually or based on a small amount of prompting instruction templates constructed manually, the prompting instruction template set is input into a large language model, and a new prompting instruction is generated according to the text generating capability of the large language model, so as to optimize and supplement the prompting instruction template set, where specific instructions for the large language model may be: please generate a plurality of templates for the task according to the provided sample according to the task target, the following table is a schematic representation of a preset instruction template structure driving a large language model in one embodiment:
step S503, obtaining named entity recognition results based on the large language model. After a plurality of preset instruction templates are acquired, a plurality of data are selected as the target examples by means of a value evaluation technology, so that a prompt instruction template is built together with the preset instruction templates. Specifically, for a single preset instruction template, based on the database, randomly selecting N initial examples related to the preset instruction template under the corresponding data types in the database, initializing the value scores of the N initial examples to be 0, enumerating all possible data combinations for the N initial examples, taking a certain data combination as an example of a prompt instruction, measuring the data value through the accuracy of a downstream named entity recognition task, and taking the marginal gain of the data value of the single data added into the data combination as the initial score thereof. For example, assuming that the N initial examples are A, B, C and each number has its own initial accuracy V, then data combination is performed, for example, an example of selecting a for the first time, his value is V1, then B data is added, where the accuracy of a under the { a, B } combination is V12-V2, similarly, the accuracy of a under the { a, B, C } combination is V123-V23, and so on, all the combinations related to a are listed and calculated, and the average value corresponding to all the combinations is the current initial score for the current example. All target examples and so on, resulting in the initial score described above. According to the initial example and the initial score corresponding to the initial example, training a regression model to predict scores of the rest data in the corresponding data types in the database, and selecting a plurality of data with the highest scores, namely the target example. And obtaining the plurality of prompt instruction templates based on the target examples and a preset prompt template. Aiming at the existing literature in the professional field, a unified Langchain text object is constructed, and the data to be identified are sliced and stored in a vector database for the use of a named entity identification task. And then, aiming at the specific requirements on the named entity recognition, which are set forth by a user, extracting relevant templates from the plurality of prompt instruction templates, extracting a plurality of text blocks with higher relevance, summarizing the content of the text blocks, embedding the text blocks into the prompt instruction templates to form text information to be recognized, which can be used for asking a large language model, and feeding the text information to be recognized into the large language model to obtain a named entity list fed back in json format, namely the named entity recognition result.
On the basis, a corresponding regular expression is obtained according to entity category information of a named entity recognition result, fuzzy search is carried out on the whole text of data to be recognized according to the regular expression, potential named entities under the same text mode are mined, then the fuzzy entities are used as questioning objects and are embedded into the corresponding potential verification templates together with the data to be recognized, the potential verification templates are judged by using a large language model, and correct or correct named entities are required to be fed back to serve as the named entity recognition result. And then, further, carrying out accurate search on the feedback named entity recognition result, and aiming at named entities under different categories, respectively carrying out forward/reverse maximum matching word segmentation and multi-mode character string accurate matching by using a daratch so as to verify whether the named entity recognition result appears in the data to be recognized, if so, reserving the named entity recognition result, and if not, deleting the named entity recognition result so as to form the final entity recognition result. The following table is code for an exact search using darmatich:
step S504, a knowledge graph is constructed, meta information provided by a literature management website is extracted as a general knowledge entity when professional literature is collected, the general knowledge entity and a final entity recognition result recognized in the above are constructed together, an example graph model is built, and the information is sent to preset display equipment corresponding to a user for display.
Step S505, optimizing the expertise database based on the user feedback. The knowledge graph is optimized according to the satisfaction degree and opinion of the user as feedback and returned to the definition and structure of the proprietary knowledge in the database, and further, the skilled person can understand that the optimization of the database is not limited to the optimization based on the knowledge graph, and can be optimized by the skilled person when needed.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a named entity recognition device for realizing the named entity recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for identifying a named entity provided below may refer to the limitation of the method for identifying a named entity hereinabove, and will not be described herein.
In one embodiment, as shown in fig. 6, there is provided a named entity recognition apparatus, including: the device comprises an acquisition module, a calculation module and a calculation module, wherein:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
Specifically, the acquiring module acquires a professional knowledge database, the professional knowledge database comprises a large number of granularity entities, the acquiring module sends the professional knowledge database to the calculating module, the calculating module acquires corresponding value scores of the granularity entities, acquires the granularity entities with high value scores as target entities according to the value scores, and synthesizes the target entities and a preset instruction template to obtain a prompt instruction template. And the calculation module is used for nesting the data to be identified into the prompt instruction templates to generate first text information to be identified, and carrying out named entity identification processing on the first text information to be identified according to machine learning to obtain a named entity identification result.
By the device, based on an accurate target example, accurate named entity recognition results are generated in batches, so that little or no human participation is realized; furthermore, the method is perpendicular to the professional field, but not the general field, aims at multi-granularity entity examples in the professional knowledge database, takes expert rules as guidance, and automatically learns and identifies important relevant named entities in the prior art documents, thereby realizing quick and efficient acquisition of a large number of named entity identification results.
The above named entity recognition means may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, a specialized knowledge database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The expertise database according to embodiments of the present application may include at least one of a relational expertise database and a non-relational expertise database. The non-relational expertise database may include, but is not limited to, a blockchain-based distributed expertise database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.
Claims (10)
1. A named entity recognition method, the method comprising:
acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities;
determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
2. The method of claim 1, wherein said expertise database includes at least one data type, each of said data types corresponding to at least two of said granularity entities; the determining a value score corresponding to the granular entity includes:
obtaining a scoring model with complete training; the scoring model is obtained through training according to an initial example corresponding to the data type in the expertise database and an initial score corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data type based on the scoring model to obtain the value scores corresponding to the granularity entities one by one.
3. The method of claim 2, wherein obtaining the initial score comprises:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and the preset instruction template, and obtaining initial accuracy corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current one of the initial examples, and determining a current accuracy rate corresponding to the current data combination from the initial accuracy rates;
scoring: obtaining a current initial score for the current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the grading step until all the initial examples are traversed, so as to obtain the initial grading which corresponds to the initial examples one by one.
4. The method of claim 1, wherein after the obtaining at least one hint instruction template, the method further comprises:
acquiring a preset detection text; the detection text comprises preset labeling information;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
performing matching calculation based on the initial recognition result and the labeling information to obtain a template accuracy result corresponding to the prompt instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
5. The method of claim 1, wherein after obtaining the named entity recognition result, the method further comprises:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
and searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
6. The method of claim 5, wherein after the ambiguous entity recognition result is obtained, the method further comprises:
acquiring a potential verification template, judging the fuzzy entity identification result based on the potential verification template to obtain a judging identification result, and correcting the fuzzy entity identification result based on the judging identification result to obtain a target fuzzy identification result;
and obtaining the final entity recognition result based on the target fuzzy recognition result and the named entity recognition result.
7. The method of claim 1, wherein the obtaining the data to be identified comprises:
acquiring initial data to be identified;
the initial to-be-identified data is subjected to segmentation processing to obtain initial to-be-identified text blocks, and the similarity of the text blocks among the initial to-be-identified text blocks is obtained based on the space distance among the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain the data to be identified.
8. The method of claim 1, wherein after obtaining the named entity recognition result, the method further comprises:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to a preset display device for display processing.
9. The method of claim 8, wherein after the sending the knowledge-graph to a preset display device for display processing, the method further comprises:
acquiring a database modification instruction aiming at the expertise database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
10. An entity recognition device oriented to a professional field, the device comprising:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311332338.2A CN117077679B (en) | 2023-10-16 | 2023-10-16 | Named entity recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311332338.2A CN117077679B (en) | 2023-10-16 | 2023-10-16 | Named entity recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117077679A true CN117077679A (en) | 2023-11-17 |
CN117077679B CN117077679B (en) | 2024-03-12 |
Family
ID=88708380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311332338.2A Active CN117077679B (en) | 2023-10-16 | 2023-10-16 | Named entity recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117077679B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725192A (en) * | 2024-02-18 | 2024-03-19 | 张家港快工品科技有限公司 | Special industrial information interaction calling method based on langchain |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704633A (en) * | 2019-09-04 | 2020-01-17 | 平安科技(深圳)有限公司 | Named entity recognition method and device, computer equipment and storage medium |
CN113449113A (en) * | 2020-03-27 | 2021-09-28 | 京东数字科技控股有限公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
WO2022048210A1 (en) * | 2020-09-03 | 2022-03-10 | 平安科技(深圳)有限公司 | Named entity recognition method and apparatus, and electronic device and readable storage medium |
CN114186013A (en) * | 2021-12-15 | 2022-03-15 | 广州华多网络科技有限公司 | Entity recognition model hot updating method and device, equipment, medium and product thereof |
CN115409111A (en) * | 2022-08-31 | 2022-11-29 | 中国工商银行股份有限公司 | Training method of named entity recognition model and named entity recognition method |
WO2022252378A1 (en) * | 2021-05-31 | 2022-12-08 | 平安科技(深圳)有限公司 | Method and apparatus for generating medical named entity recognition model, and computer device |
CN116484867A (en) * | 2023-04-19 | 2023-07-25 | 平安科技(深圳)有限公司 | Named entity recognition method and device, storage medium and computer equipment |
-
2023
- 2023-10-16 CN CN202311332338.2A patent/CN117077679B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704633A (en) * | 2019-09-04 | 2020-01-17 | 平安科技(深圳)有限公司 | Named entity recognition method and device, computer equipment and storage medium |
CN113449113A (en) * | 2020-03-27 | 2021-09-28 | 京东数字科技控股有限公司 | Knowledge graph construction method and device, electronic equipment and storage medium |
WO2022048210A1 (en) * | 2020-09-03 | 2022-03-10 | 平安科技(深圳)有限公司 | Named entity recognition method and apparatus, and electronic device and readable storage medium |
WO2022252378A1 (en) * | 2021-05-31 | 2022-12-08 | 平安科技(深圳)有限公司 | Method and apparatus for generating medical named entity recognition model, and computer device |
CN114186013A (en) * | 2021-12-15 | 2022-03-15 | 广州华多网络科技有限公司 | Entity recognition model hot updating method and device, equipment, medium and product thereof |
CN115409111A (en) * | 2022-08-31 | 2022-11-29 | 中国工商银行股份有限公司 | Training method of named entity recognition model and named entity recognition method |
CN116484867A (en) * | 2023-04-19 | 2023-07-25 | 平安科技(深圳)有限公司 | Named entity recognition method and device, storage medium and computer equipment |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725192A (en) * | 2024-02-18 | 2024-03-19 | 张家港快工品科技有限公司 | Special industrial information interaction calling method based on langchain |
CN117725192B (en) * | 2024-02-18 | 2024-05-14 | 张家港快工品科技有限公司 | Langchain-based proprietary industrial information interaction calling method |
Also Published As
Publication number | Publication date |
---|---|
CN117077679B (en) | 2024-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112632385B (en) | Course recommendation method, course recommendation device, computer equipment and medium | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN109376222B (en) | Question-answer matching degree calculation method, question-answer automatic matching method and device | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
CN111930792B (en) | Labeling method and device for data resources, storage medium and electronic equipment | |
CN111190997A (en) | Question-answering system implementation method using neural network and machine learning sequencing algorithm | |
CN112163424A (en) | Data labeling method, device, equipment and medium | |
CN111368048A (en) | Information acquisition method and device, electronic equipment and computer readable storage medium | |
CN112100401B (en) | Knowledge graph construction method, device, equipment and storage medium for science and technology services | |
CN117077679B (en) | Named entity recognition method and device | |
CN115062134B (en) | Knowledge question-answering model training and knowledge question-answering method, device and computer equipment | |
CN112632258A (en) | Text data processing method and device, computer equipment and storage medium | |
CN111241310A (en) | Deep cross-modal Hash retrieval method, equipment and medium | |
CN113505786A (en) | Test question photographing and judging method and device and electronic equipment | |
CN110110218A (en) | A kind of Identity Association method and terminal | |
CN116796730A (en) | Text error correction method, device, equipment and storage medium based on artificial intelligence | |
CN110852071A (en) | Knowledge point detection method, device, equipment and readable storage medium | |
CN113934834A (en) | Question matching method, device, equipment and storage medium | |
CN117725895A (en) | Document generation method, device, equipment and medium | |
JP2021163477A (en) | Method, apparatus, electronic device, computer-readable storage medium, and computer program for image processing | |
CN111143515B (en) | Text matching method and device | |
CN117435685A (en) | Document retrieval method, document retrieval device, computer equipment, storage medium and product | |
CN112100355A (en) | Intelligent interaction method, device and equipment | |
CN116166858A (en) | Information recommendation method, device, equipment and storage medium based on artificial intelligence | |
CN113486649B (en) | Text comment generation method and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |