CN117077679A - Named entity recognition method and device - Google Patents

Named entity recognition method and device Download PDF

Info

Publication number
CN117077679A
CN117077679A CN202311332338.2A CN202311332338A CN117077679A CN 117077679 A CN117077679 A CN 117077679A CN 202311332338 A CN202311332338 A CN 202311332338A CN 117077679 A CN117077679 A CN 117077679A
Authority
CN
China
Prior art keywords
initial
entity
identified
data
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311332338.2A
Other languages
Chinese (zh)
Other versions
CN117077679B (en
Inventor
张睿
李清明
姬朋立
严笑然
胡耀华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311332338.2A priority Critical patent/CN117077679B/en
Publication of CN117077679A publication Critical patent/CN117077679A/en
Application granted granted Critical
Publication of CN117077679B publication Critical patent/CN117077679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a named entity identification method and device. The method comprises the following steps: acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities; determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity; and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result. By adopting the method, the recognition of the professional named entity in the professional field can be realized efficiently and accurately.

Description

Named entity recognition method and device
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a named entity recognition method and apparatus.
Background
Named entity recognition (Name Entity Recognition, NER for short) is a key task in the field of natural language processing, and the main purpose of the named entity recognition is to recognize and classify named entities with specific meaning from texts, and is a technical support for practical application such as information extraction, question-answering systems, knowledge graph construction and the like. Therefore, there is an urgent need for efficient and accurate NER technology, both in academia and industry.
Some scholars have already tried to assist named entity recognition task by using large language models (Large Language Models, abbreviated as LLMs), but these models mainly aim at general named entities, such as names of people, places, organizations, etc., rather than non-general expert knowledge in specific fields (such as astronomical fields), which also results in that the existing named entity recognition technology cannot meet the requirement of professionals for expert knowledge in specific fields.
At present, no effective solution has been proposed for how to efficiently and accurately identify a professional named entity in a professional field.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a named entity recognition method and device for solving the above technical problems.
In a first aspect, the present application provides a named entity recognition method. The method comprises the following steps:
acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities;
determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
In one embodiment, the expertise database includes at least one data type, each data type corresponding to at least two granularity entities; obtaining a value score corresponding to the granularity entity one by one for the granularity entity comprises:
obtaining a scoring model with complete training; the scoring model is obtained according to an initial example corresponding to the data type in the expertise database and an initial scoring training corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data types based on the scoring model to obtain value scores corresponding to the granularity entities one by one.
In one embodiment, obtaining the initial score includes:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and a preset instruction template, and obtaining initial accuracy rate corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current initial example of the initial examples, and determining a current accuracy rate corresponding to the current data combinations from the initial accuracy rates;
scoring: obtaining a current initial score aiming at a current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the scoring step until all the initial examples are traversed to obtain initial scores corresponding to the initial examples one to one.
In one embodiment, after obtaining the at least one hint instruction template, the method further comprises:
acquiring a preset detection text;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
based on the initial recognition result and the prompt instruction template, carrying out matching calculation to obtain a template accuracy result corresponding to the prompt instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
In one embodiment, after obtaining the fuzzy entity identification result, the method further includes:
acquiring a potential verification template, judging the fuzzy entity recognition result based on the potential verification template to obtain a judging recognition result, and correcting the fuzzy entity recognition result based on the judging recognition result to obtain a target fuzzy recognition result;
and obtaining a final entity identification result based on the target fuzzy identification result and the named entity identification result.
In one embodiment, acquiring data to be identified includes:
acquiring initial data to be identified;
the method comprises the steps of performing segmentation processing on initial to-be-identified data to obtain initial to-be-identified text blocks, and obtaining text block similarity between the initial to-be-identified text blocks based on the space distance between the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain data to be identified.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to preset display equipment for display processing.
In one embodiment, after the knowledge graph is sent to a preset display device for display processing, the method further includes:
acquiring a database modification instruction aiming at a professional knowledge database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
In a second aspect, the application further provides an entity identification device oriented to the professional field. The device comprises:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
According to the named entity recognition method and device, on one hand, the named entity recognition result which is more accurately aimed at the technical field of expertise can be extracted from the text to be recognized through the target entity obtained from the expertise database and the prompt instruction template obtained by the preset instruction template; on the other hand, the prompt instruction template formed by the target entity in the expertise database is obtained based on the value score, so that the preset instruction template can be better exemplified, and the identification of the data to be identified is more accurate and efficient.
Drawings
FIG. 1 is a diagram of an application environment for a named entity recognition method in one embodiment;
FIG. 2 is a flow diagram of a named entity recognition method in one embodiment;
FIG. 3 is a schematic diagram of a knowledge graph illustrating astronomical area in one embodiment;
FIG. 4 is an optimization diagram based on database modification instructions in another embodiment;
FIG. 5 is a flow chart of a named entity recognition method in accordance with a preferred embodiment;
FIG. 6 is a block diagram of a named entity recognition device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The named entity identification method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. Firstly, acquiring a professional knowledge database comprising granularity entities, secondly, determining a value score corresponding to the granularity entities, determining a target entity from the granularity entities based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template; and finally, generating first text information to be identified according to the prompt instruction template and the data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a named entity recognition method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step S202, acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities.
The indication contained in the expertise database is to analyze and label a small amount of literature data in astronomical field by means of expert knowledge, specifically to manually label the granularity entities by means of expert knowledge, construct a multi-granularity entity labeling data set, define the structuring relation among the entities and form the expertise database aiming at the expertise field rather than the general field. The granularity in the granularity entity reflects various granularities of the entity in the database, such as a professional knowledge database for astronomy, the granularity entity can comprise coarser granularities, such as celestial names, telescope names and the like, and further the granularity entity can also comprise more detailed granularities, such as sun, moon and the like.
Step S204, determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity.
The value scoring is in a one-to-one correspondence with the granularity entity, and it can be understood that the value scoring can be performed manually on the granularity entity, or can be performed on the granularity entity based on a neural network with complete training, the value scoring result indicates the matching degree of the granularity entity and the preset instruction template, after the value scoring corresponding to all the granularity entities is determined, the granularity entity with high value scoring is selected and combined with the preset instruction template to obtain a prompt instruction template, wherein the prompt instruction template is used for acquiring a corresponding named entity identification result aiming at the data to be identified.
Step S206, based on the prompt instruction template and the acquired data to be identified, generating first text information to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
The data to be identified may be a section of text or literature aiming at the technical knowledge field, and the data to be identified is nested in a prompt instruction template to obtain the first generation of identification text information, which can be understood that different named entity information can be extracted for the data to be identified according to different prompt instruction templates. Preferably, a named entity recognition process can be performed based on the first text information to be recognized by using a well-trained large language model, wherein the named entity recognition process corresponds to the granularity and the type of the target entity in the prompt instruction template.
Through steps S202 to S206, unlike the prior art, the format of the obtained named entity recognition result is generally improved in the prior art to obtain format content with a more standard and specialized format, and the method and the device optimize the target entity in the prompt instruction template to realize more accurate recognition of the text information to be recognized. Furthermore, through a plurality of granularity entities constructed in the expertise database, the application can be used for realizing the multi-granularity named entity recognition result by relevant technicians based on screening the target entity and screening the granularity of the target entity in actual application on the basis of ensuring the accuracy of the named entity recognition result.
In one embodiment, the expertise database includes at least one data type, each data type corresponding to at least two granularity entities; obtaining a value score corresponding to the granularity entity one by one for the granularity entity comprises:
obtaining a scoring model with complete training; the scoring model is obtained according to an initial example corresponding to the data type in the expertise database and an initial scoring training corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data types based on the scoring model to obtain value scores corresponding to the granularity entities one by one.
Specifically, the expertise database includes a plurality of data types, each data type corresponds to a large amount of granularity entities, and, taking the astronomical field as an example, the data types may include celestial names, barefoot, declination, and the like. And acquiring a plurality of initial examples corresponding to the preset instruction templates under one type, and acquiring initial scores corresponding to the initial examples, wherein the acquisition of the initial scores can be set manually, and the initial scores corresponding to the initial examples can be obtained based on preset detection texts. And training the scoring model based on the initial example and the initial scoring to obtain a training complete scoring model, and evaluating the rest granularity entities in the corresponding data types according to the training complete scoring model to obtain the value scores corresponding to the rest granularity entities. The method can flexibly and rapidly acquire the value scores of all granularity entities in the data types corresponding to the instruction templates in the database, has higher accuracy, only needs to carry out a small amount of marking manually, does not need to carry out a large amount of marking, and saves labor cost and computing resources.
In one embodiment, obtaining the initial score includes:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and a preset instruction template, and obtaining initial accuracy rate corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current initial example of the initial examples, and determining a current accuracy rate corresponding to the current data combinations from the initial accuracy rates;
scoring: obtaining a current initial score aiming at a current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the scoring step until all the initial examples are traversed to obtain initial scores corresponding to the initial examples one to one.
Specifically, the data value marginal gain of adding a single initial example to the data combination can be expected to serve as the initial score, wherein specifically, initial accuracy corresponding to the initial example is firstly determined, the initial accuracy can be obtained according to a manual setting mode, or the initial accuracy of the initial example can be obtained according to the recognition result of the detected text on the basis of nesting an initial instruction template with the detected text. And after the initial accuracy is obtained, data combination is carried out, for example, after the current initial example and the corresponding current initial accuracy are determined, the other initial example is added, accuracy calculation is carried out according to the accuracy of the other initial example and the current initial accuracy corresponding to the current initial example, the accuracy corresponding to the two examples is obtained, and similarly, the combination related to the current initial example is arranged and calculated once, and the average value of the obtained current accuracy is the current initial score corresponding to the current initial example. And the like, the initial scores corresponding to all the initial examples can be calculated. The method can more accurately acquire the initial scores corresponding to the initial examples, and calculate the average value through the arrangement and combination of multiple groups of data, so that the initial scores can better reflect the matching degree of the initial examples and the corresponding initial instruction templates.
In one embodiment, after obtaining the at least one hint instruction template, the method further comprises:
acquiring a preset detection text; the detection text comprises preset labeling information;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
performing matching calculation based on the initial recognition result and the labeling information to obtain a template accuracy result corresponding to the prompting instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
Specifically, after a plurality of prompt instruction templates are obtained, testing the prompt instruction templates according to a preset detection text marked in advance, obtaining a corresponding initial recognition result aiming at the detection text, and matching the initial recognition result with marked information in the detection text to obtain a template accuracy result, wherein further, the marked information is generally marked manually and is used for indicating the prompt instruction templates corresponding to the entities in the detection text. If the accuracy result is smaller than a preset accuracy threshold, judging that the template is not achieved, deleting the template, and if the accuracy result is larger than or equal to the accuracy threshold, reserving, wherein the accuracy threshold can be set by a user at will, and the accuracy threshold is generally set to be 0.8. Through the method, screening of a plurality of prompt instruction templates can be completed, and the preset instruction templates and target examples corresponding to the preset instruction templates are synthesized and tested, so that more accurate and efficient prompt instruction templates are obtained.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
Specifically, the entity category information contained in the data to be identified is determined according to the named entity identification result, and the entity category information includes, but is not limited to, categories of celestial body names, declination corresponding to celestial bodies and the like, taking the astronomical field as an example. After determining the entity types contained in the data to be identified, correspondingly determining regular expressions, wherein the following table is an example of the regular expressions corresponding to different entity types:
taking the astronomical field as an example, the named entities in the astronomical field mostly display special formats, such as capital English letters plus numbers plus special characters, or numbers plus special characters plus numbers, etc. Therefore, fuzzy retrieval is carried out on the full text of the data to be identified based on the regular expression, the supplement of the named entity identification result is completed, potential named entities under the same text mode are mined, so that the named entity identification result and the fuzzy entity identification result can only be obtained in a strange mode, and the final entity identification result is obtained. By the method, the named entity recognition result is supplemented, so that the recognition result is more comprehensive, further, the regular expression is determined based on the named entity recognition result, the waste of calculation cost is avoided, and the retrieval efficiency is accelerated.
In one embodiment, after obtaining the fuzzy entity identification result, the method further includes:
acquiring a potential verification template, judging the fuzzy entity recognition result based on the potential verification template to obtain a judging recognition result, and correcting the fuzzy entity recognition result based on the judging recognition result to obtain a target fuzzy recognition result;
and obtaining a final entity identification result based on the target fuzzy identification result and the named entity identification result.
Specifically, the above potential verification template is used to help machine learning evaluate and improve the extracted result, taking astronomical field as an example, and the following table is an example of a potential verification template driving a large language model in one embodiment:
based on the fuzzy entity dug according to the regular expression as a question object and the data to be detected are embedded into the corresponding potential verification template, the potential verification template is judged by machine learning, and the correct named entity is required to be fed back or the incorrect named entity is required to be changed into the correct named entity to be used as the target fuzzy recognition result. According to the method, although the searching range is wider in the fuzzy searching, the accuracy is slightly insufficient compared with the entity identification result obtained based on the prompt instruction template, so that the fuzzy entity obtained based on the fuzzy searching is subjected to supplementary verification, and the higher accuracy is ensured while the searching range is enlarged.
In one embodiment, acquiring data to be identified includes:
acquiring initial data to be identified;
the method comprises the steps of performing segmentation processing on initial to-be-identified data to obtain initial to-be-identified text blocks, and obtaining text block similarity between the initial to-be-identified text blocks based on the space distance between the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain data to be identified.
Specifically, the initial to-be-identified data is usually long literature data, and the word number limitation of the template is considered in practical application, so that the initial to-be-identified data is subjected to segmentation processing to obtain a plurality of initial to-be-identified text blocks, so that the named entity identification task is used. And then searching in advance in a plurality of initial text blocks to be identified according to specific requirements on named entity identification, extracting a plurality of text blocks with relatively high correlation and relatively close spatial distance, summarizing the content of the text blocks or splicing a plurality of text blocks, thereby obtaining data to be identified. The method can realize simpler, more convenient and flexible operation in practical application, and can identify a plurality of text blocks based on high correlation, thereby further improving the identifying efficiency of the named entity.
In one embodiment, after obtaining the named entity recognition result, the method further includes:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to preset display equipment for display processing.
Specifically, fig. 3 is a schematic diagram of a knowledge graph, which is taken as an example in the astronomical field in one embodiment, and the specific content of the knowledge graph may be set by a relevant technician, as shown in fig. 3, and may include data such as thesis titles, celestial names, keywords, and the like. And further, in order to enable the user to have better viewing experience in actual application, according to the mata information provided by a literature management website when the data to be identified is a literature in the professional field, extracting topics, authors and author mechanisms corresponding to the literature as general academic knowledge entities in the knowledge graph, and combining the general academic knowledge entities with the named entity identification result in the professional field to perfect the knowledge graph. The method improves the ornamental value of the named entity recognition result, is closely attached to the actual application of the user, and can adapt to wider application scenes.
In one embodiment, after the knowledge graph is sent to a preset display device for display processing, the method further includes:
acquiring a database modification instruction aiming at a professional knowledge database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
Specifically, fig. 4 is an optimization schematic diagram based on a database modification instruction in an embodiment, where a user puts forward a requirement to a named entity recognition platform according to a requirement of own entity recognition, where the requirement sent by the user includes, but is not limited to, the data to be recognized and a prompt instruction template, and the named entity recognition platform obtains first text information to be recognized based on the data to be recognized and the prompt instruction template according to a content input by the user by using the method set forth in the foregoing, and performs named entity recognition processing on the first text information to be recognized to obtain a named entity recognition result. After the user obtains the corresponding knowledge graph according to the display equipment, based on the knowledge graph as the recognition result feedback, the expert knowledge database is further modified according to the knowledge graph, so that the knowledge database is further improved, and the knowledge optimization in fig. 4 is realized. By the method, the personalized expert knowledge database aiming at different users can be obtained according to the changing result of the users, and the efficiency and accuracy of entity identification are further improved.
The embodiment also provides a specific embodiment of a named entity recognition method, as shown in fig. 5, and fig. 5 is a flow chart of the named entity recognition method in a preferred embodiment.
Step S501, a expertise database is constructed. The construction of the professional knowledge database analyzes and marks a small amount of literature data in the professional field by means of expert knowledge in the professional field, specifically, the expert indication is used for carrying out a small amount of manual marking on granularity entities in the database, and the professional platform is used for combing out only the structure, so that a multi-granularity entity marking result is constructed, the structural relation among granularity entities is defined, and the expert knowledge database in the professional field is formed.
Step S502, a preset instruction template is obtained. Preferably, the preset instruction template may be a prompting instruction template constructed manually or based on a small amount of prompting instruction templates constructed manually, the prompting instruction template set is input into a large language model, and a new prompting instruction is generated according to the text generating capability of the large language model, so as to optimize and supplement the prompting instruction template set, where specific instructions for the large language model may be: please generate a plurality of templates for the task according to the provided sample according to the task target, the following table is a schematic representation of a preset instruction template structure driving a large language model in one embodiment:
step S503, obtaining named entity recognition results based on the large language model. After a plurality of preset instruction templates are acquired, a plurality of data are selected as the target examples by means of a value evaluation technology, so that a prompt instruction template is built together with the preset instruction templates. Specifically, for a single preset instruction template, based on the database, randomly selecting N initial examples related to the preset instruction template under the corresponding data types in the database, initializing the value scores of the N initial examples to be 0, enumerating all possible data combinations for the N initial examples, taking a certain data combination as an example of a prompt instruction, measuring the data value through the accuracy of a downstream named entity recognition task, and taking the marginal gain of the data value of the single data added into the data combination as the initial score thereof. For example, assuming that the N initial examples are A, B, C and each number has its own initial accuracy V, then data combination is performed, for example, an example of selecting a for the first time, his value is V1, then B data is added, where the accuracy of a under the { a, B } combination is V12-V2, similarly, the accuracy of a under the { a, B, C } combination is V123-V23, and so on, all the combinations related to a are listed and calculated, and the average value corresponding to all the combinations is the current initial score for the current example. All target examples and so on, resulting in the initial score described above. According to the initial example and the initial score corresponding to the initial example, training a regression model to predict scores of the rest data in the corresponding data types in the database, and selecting a plurality of data with the highest scores, namely the target example. And obtaining the plurality of prompt instruction templates based on the target examples and a preset prompt template. Aiming at the existing literature in the professional field, a unified Langchain text object is constructed, and the data to be identified are sliced and stored in a vector database for the use of a named entity identification task. And then, aiming at the specific requirements on the named entity recognition, which are set forth by a user, extracting relevant templates from the plurality of prompt instruction templates, extracting a plurality of text blocks with higher relevance, summarizing the content of the text blocks, embedding the text blocks into the prompt instruction templates to form text information to be recognized, which can be used for asking a large language model, and feeding the text information to be recognized into the large language model to obtain a named entity list fed back in json format, namely the named entity recognition result.
On the basis, a corresponding regular expression is obtained according to entity category information of a named entity recognition result, fuzzy search is carried out on the whole text of data to be recognized according to the regular expression, potential named entities under the same text mode are mined, then the fuzzy entities are used as questioning objects and are embedded into the corresponding potential verification templates together with the data to be recognized, the potential verification templates are judged by using a large language model, and correct or correct named entities are required to be fed back to serve as the named entity recognition result. And then, further, carrying out accurate search on the feedback named entity recognition result, and aiming at named entities under different categories, respectively carrying out forward/reverse maximum matching word segmentation and multi-mode character string accurate matching by using a daratch so as to verify whether the named entity recognition result appears in the data to be recognized, if so, reserving the named entity recognition result, and if not, deleting the named entity recognition result so as to form the final entity recognition result. The following table is code for an exact search using darmatich:
step S504, a knowledge graph is constructed, meta information provided by a literature management website is extracted as a general knowledge entity when professional literature is collected, the general knowledge entity and a final entity recognition result recognized in the above are constructed together, an example graph model is built, and the information is sent to preset display equipment corresponding to a user for display.
Step S505, optimizing the expertise database based on the user feedback. The knowledge graph is optimized according to the satisfaction degree and opinion of the user as feedback and returned to the definition and structure of the proprietary knowledge in the database, and further, the skilled person can understand that the optimization of the database is not limited to the optimization based on the knowledge graph, and can be optimized by the skilled person when needed.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a named entity recognition device for realizing the named entity recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for identifying a named entity provided below may refer to the limitation of the method for identifying a named entity hereinabove, and will not be described herein.
In one embodiment, as shown in fig. 6, there is provided a named entity recognition apparatus, including: the device comprises an acquisition module, a calculation module and a calculation module, wherein:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
Specifically, the acquiring module acquires a professional knowledge database, the professional knowledge database comprises a large number of granularity entities, the acquiring module sends the professional knowledge database to the calculating module, the calculating module acquires corresponding value scores of the granularity entities, acquires the granularity entities with high value scores as target entities according to the value scores, and synthesizes the target entities and a preset instruction template to obtain a prompt instruction template. And the calculation module is used for nesting the data to be identified into the prompt instruction templates to generate first text information to be identified, and carrying out named entity identification processing on the first text information to be identified according to machine learning to obtain a named entity identification result.
By the device, based on an accurate target example, accurate named entity recognition results are generated in batches, so that little or no human participation is realized; furthermore, the method is perpendicular to the professional field, but not the general field, aims at multi-granularity entity examples in the professional knowledge database, takes expert rules as guidance, and automatically learns and identifies important relevant named entities in the prior art documents, thereby realizing quick and efficient acquisition of a large number of named entity identification results.
The above named entity recognition means may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, a specialized knowledge database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The expertise database according to embodiments of the present application may include at least one of a relational expertise database and a non-relational expertise database. The non-relational expertise database may include, but is not limited to, a blockchain-based distributed expertise database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A named entity recognition method, the method comprising:
acquiring a professional knowledge database; wherein the expertise database comprises at least two granularity entities;
determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
and generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
2. The method of claim 1, wherein said expertise database includes at least one data type, each of said data types corresponding to at least two of said granularity entities; the determining a value score corresponding to the granular entity includes:
obtaining a scoring model with complete training; the scoring model is obtained through training according to an initial example corresponding to the data type in the expertise database and an initial score corresponding to the initial example, and the scoring model and the data type are in one-to-one correspondence;
and evaluating the granularity entities in the data type based on the scoring model to obtain the value scores corresponding to the granularity entities one by one.
3. The method of claim 2, wherein obtaining the initial score comprises:
the calculation steps are as follows: obtaining an initial instruction template based on the initial example and the preset instruction template, and obtaining initial accuracy corresponding to the initial example one by one based on a preset detection text;
determining at least one set of current data combinations based on a current one of the initial examples, and determining a current accuracy rate corresponding to the current data combination from the initial accuracy rates;
scoring: obtaining a current initial score for the current initial example according to the average value of all the current accuracy rates;
determining at least one group of next data combinations based on the next initial examples in the initial examples, and repeating the calculating step and the grading step until all the initial examples are traversed, so as to obtain the initial grading which corresponds to the initial examples one by one.
4. The method of claim 1, wherein after the obtaining at least one hint instruction template, the method further comprises:
acquiring a preset detection text; the detection text comprises preset labeling information;
identifying the detection text based on the prompt instruction template to obtain an initial identification result corresponding to the prompt instruction template;
performing matching calculation based on the initial recognition result and the labeling information to obtain a template accuracy result corresponding to the prompt instruction template;
under the condition that the accuracy rate to be deleted in the template accuracy rate result is detected to be smaller than a preset accuracy rate threshold value, deleting the to-be-deleted instruction template corresponding to the accuracy rate to be deleted, and obtaining a target prompt instruction template based on the rest instruction templates in the prompt instruction templates;
and generating second text information to be identified based on the target prompt instruction template and the acquired data to be identified.
5. The method of claim 1, wherein after obtaining the named entity recognition result, the method further comprises:
determining entity category information corresponding to the data to be identified based on the named entity identification result, and determining a regular expression according to the entity category information;
and searching the data to be identified based on the regular expression to obtain a fuzzy entity identification result, and obtaining a final entity identification result based on the fuzzy entity identification result and the named entity identification result.
6. The method of claim 5, wherein after the ambiguous entity recognition result is obtained, the method further comprises:
acquiring a potential verification template, judging the fuzzy entity identification result based on the potential verification template to obtain a judging identification result, and correcting the fuzzy entity identification result based on the judging identification result to obtain a target fuzzy identification result;
and obtaining the final entity recognition result based on the target fuzzy recognition result and the named entity recognition result.
7. The method of claim 1, wherein the obtaining the data to be identified comprises:
acquiring initial data to be identified;
the initial to-be-identified data is subjected to segmentation processing to obtain initial to-be-identified text blocks, and the similarity of the text blocks among the initial to-be-identified text blocks is obtained based on the space distance among the initial to-be-identified text blocks;
and performing splicing processing on similar text blocks in the initial text blocks to be identified based on the text block similarity to obtain the data to be identified.
8. The method of claim 1, wherein after obtaining the named entity recognition result, the method further comprises:
and constructing a knowledge graph aiming at the data to be identified based on the named entity identification result, and sending the knowledge graph to a preset display device for display processing.
9. The method of claim 8, wherein after the sending the knowledge-graph to a preset display device for display processing, the method further comprises:
acquiring a database modification instruction aiming at the expertise database based on the knowledge graph;
and updating the expertise database based on the database modification instruction to obtain a target expertise database aiming at the expertise field.
10. An entity recognition device oriented to a professional field, the device comprising:
the acquisition module is used for acquiring the expertise database; wherein the expertise database comprises at least two granularity entities;
the computing module is used for determining a value score corresponding to the granularity entity, determining a target entity from the granularity entity based on the value score, and obtaining a prompt instruction template according to the target entity and a preset instruction template corresponding to the target entity;
the generation module is used for generating first text information to be identified based on the prompt instruction template and the acquired data to be identified, and carrying out named entity identification processing on the first text information to be identified to obtain a named entity identification result.
CN202311332338.2A 2023-10-16 2023-10-16 Named entity recognition method and device Active CN117077679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311332338.2A CN117077679B (en) 2023-10-16 2023-10-16 Named entity recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311332338.2A CN117077679B (en) 2023-10-16 2023-10-16 Named entity recognition method and device

Publications (2)

Publication Number Publication Date
CN117077679A true CN117077679A (en) 2023-11-17
CN117077679B CN117077679B (en) 2024-03-12

Family

ID=88708380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311332338.2A Active CN117077679B (en) 2023-10-16 2023-10-16 Named entity recognition method and device

Country Status (1)

Country Link
CN (1) CN117077679B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725192A (en) * 2024-02-18 2024-03-19 张家港快工品科技有限公司 Special industrial information interaction calling method based on langchain

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN113449113A (en) * 2020-03-27 2021-09-28 京东数字科技控股有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
WO2022048210A1 (en) * 2020-09-03 2022-03-10 平安科技(深圳)有限公司 Named entity recognition method and apparatus, and electronic device and readable storage medium
CN114186013A (en) * 2021-12-15 2022-03-15 广州华多网络科技有限公司 Entity recognition model hot updating method and device, equipment, medium and product thereof
CN115409111A (en) * 2022-08-31 2022-11-29 中国工商银行股份有限公司 Training method of named entity recognition model and named entity recognition method
WO2022252378A1 (en) * 2021-05-31 2022-12-08 平安科技(深圳)有限公司 Method and apparatus for generating medical named entity recognition model, and computer device
CN116484867A (en) * 2023-04-19 2023-07-25 平安科技(深圳)有限公司 Named entity recognition method and device, storage medium and computer equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704633A (en) * 2019-09-04 2020-01-17 平安科技(深圳)有限公司 Named entity recognition method and device, computer equipment and storage medium
CN113449113A (en) * 2020-03-27 2021-09-28 京东数字科技控股有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
WO2022048210A1 (en) * 2020-09-03 2022-03-10 平安科技(深圳)有限公司 Named entity recognition method and apparatus, and electronic device and readable storage medium
WO2022252378A1 (en) * 2021-05-31 2022-12-08 平安科技(深圳)有限公司 Method and apparatus for generating medical named entity recognition model, and computer device
CN114186013A (en) * 2021-12-15 2022-03-15 广州华多网络科技有限公司 Entity recognition model hot updating method and device, equipment, medium and product thereof
CN115409111A (en) * 2022-08-31 2022-11-29 中国工商银行股份有限公司 Training method of named entity recognition model and named entity recognition method
CN116484867A (en) * 2023-04-19 2023-07-25 平安科技(深圳)有限公司 Named entity recognition method and device, storage medium and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725192A (en) * 2024-02-18 2024-03-19 张家港快工品科技有限公司 Special industrial information interaction calling method based on langchain
CN117725192B (en) * 2024-02-18 2024-05-14 张家港快工品科技有限公司 Langchain-based proprietary industrial information interaction calling method

Also Published As

Publication number Publication date
CN117077679B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN112632385B (en) Course recommendation method, course recommendation device, computer equipment and medium
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN112163424A (en) Data labeling method, device, equipment and medium
CN111368048A (en) Information acquisition method and device, electronic equipment and computer readable storage medium
CN117077679B (en) Named entity recognition method and device
CN115062134B (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN112632258A (en) Text data processing method and device, computer equipment and storage medium
CN111241310A (en) Deep cross-modal Hash retrieval method, equipment and medium
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN114254116A (en) Document data text classification method, classification model construction method and classification device
CN113934834A (en) Question matching method, device, equipment and storage medium
CN113505786A (en) Test question photographing and judging method and device and electronic equipment
CN117725895A (en) Document generation method, device, equipment and medium
CN111143515B (en) Text matching method and device
CN111950265A (en) Domain lexicon construction method and device
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
CN112100355A (en) Intelligent interaction method, device and equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113486649B (en) Text comment generation method and electronic device
CN114116971A (en) Model training method and device for generating similar texts and computer equipment
CN115269816A (en) Core personnel mining method and device based on information processing method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant