CN110738055A - Text entity identification method, text entity identification equipment and storage medium - Google Patents

Text entity identification method, text entity identification equipment and storage medium Download PDF

Info

Publication number
CN110738055A
CN110738055A CN201911013316.3A CN201911013316A CN110738055A CN 110738055 A CN110738055 A CN 110738055A CN 201911013316 A CN201911013316 A CN 201911013316A CN 110738055 A CN110738055 A CN 110738055A
Authority
CN
China
Prior art keywords
entity
target mechanism
target
text
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911013316.3A
Other languages
Chinese (zh)
Inventor
邸凡祎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201911013316.3A priority Critical patent/CN110738055A/en
Publication of CN110738055A publication Critical patent/CN110738055A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the disclosure provides an entity identification method, equipment and a storage medium of texts, wherein the entity identification method comprises the steps of obtaining a text to be processed, identifying a mechanism entity full name in the text to be processed, identifying a target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain a category mark mechanism entity, identifying a target mechanism entity short name in the text to be processed according to the preset target mechanism entity short name dictionary to obtain a second category mark mechanism entity, identifying an entity in the text to be processed according to a pre-trained identification model to obtain a third category target mechanism entity, combining various target mechanism entities, and outputting the combined entity as a target mechanism entity contained in the text to be processed.

Description

Text entity identification method, text entity identification equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of computer and network communication, in particular to an entity identification method, device and storage medium for texts.
Background
The named entity recognition is basic works in natural language processing, and is very important preprocessing processes of tasks such as syntactic analysis, machine translation, information extraction and the like, generally speaking, the task of named entity recognition is to recognize standard named entities such as names of people, organization names, place names, time, date, currency, percentages and the like appearing in the text to be processed, wherein organization names belong to types of named entities which are difficult to recognize.
In the prior art, methods based on rules, a statistical method, a deep learning model and the like are usually adopted to identify the organization name entity of the text, and because organization names have the difficulties of various types, unfixed lengths, short names and the like, the identification of organization names is difficult, and especially if some specific organization names are identified, the identification accuracy and recall rate are poor.
Disclosure of Invention
The embodiment of the disclosure provides text entity identification methods, devices and storage media, so as to improve the accuracy and recall rate of target institution entity identification in a text to be processed.
, the embodiment of the present disclosure provides an entity recognition method of texts, including:
acquiring a text to be processed;
recognizing the mechanism entity full name in the text to be processed, and recognizing the target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain an th category mark mechanism entity;
identifying a target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
performing entity recognition on the text to be processed according to a pre-trained recognition model to obtain a third type target mechanism entity;
and merging various target mechanism entities to serve as the target mechanism entities contained in the text to be processed, and outputting the target mechanism entities.
In a second aspect, an embodiment of the present disclosure provides an entity recognition apparatus for texts, including:
the input module is used for acquiring a text to be processed;
an recognition module, configured to recognize a full mechanism entity name in the text to be processed, and recognize a full target mechanism entity name in the full mechanism entity name according to a preset target mechanism entity suffix dictionary to obtain a -th category label mechanism entity;
the second recognition module is used for recognizing the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
the third identification module is used for carrying out entity identification on the text to be processed according to a pre-trained identification model to obtain a third type target mechanism entity;
and the output module is used for combining various target mechanism entities, serving as the target mechanism entities contained in the text to be processed and outputting the target mechanism entities.
In a third aspect, embodiments of the present disclosure provide electronic devices including at least processors and memory;
the memory stores computer-executable instructions;
the at least processors execute the memory-stored computer-executable instructions that cause the at least processors to perform the entity recognition methods of text as described in the various possible designs of aspects and above.
In a fourth aspect, the disclosed embodiments provide computer-readable storage media having stored therein computer-executable instructions that, when executed by a processor, implement methods for entity recognition of text as described in the various possible designs of aspects and above.
The method comprises the steps of obtaining a text to be processed, identifying a mechanism entity full name in the text to be processed, identifying a target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain an -th category marking mechanism entity, identifying a target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second category marking mechanism entity, carrying out entity identification on the text to be processed according to a pre-trained identification model to obtain a third category target mechanism entity, combining various target mechanism entities to serve as the target mechanism entity contained in the text to be processed, and outputting the target mechanism entity and the target mechanism entity abbreviation.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, is briefly introduced in the drawings required to be used in the description of the embodiments or the prior art, it is obvious that the drawings in the following description are embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a schematic flowchart of an entity recognition method for text according to an embodiment of the present disclosure;
fig. 2 is an application scenario diagram of an entity identification method for a text according to an embodiment of the present disclosure;
fig. 3 is a schematic flow chart of a text entity identification method according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of a text entity identification method according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a text entity identification method according to a fourth embodiment of the present disclosure;
fig. 6 is a block diagram of a text entity recognition apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a second structure of an entity recognition apparatus for text according to an embodiment of the present disclosure;
fig. 8 is a schematic hardware structure diagram of an entity identification device for text according to an embodiment of the present disclosure.
Detailed Description
For purposes of making the objects, aspects and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described in detail, clearly and completely, in connection with the accompanying drawings of the embodiments of the present disclosure, it is to be understood that the described embodiments are some, but not all, embodiments of the of the present disclosure.
The embodiments of the present disclosure are directed to identifying a target institution Entity contained in a to-be-processed text, where the Entity in the embodiments of the present disclosure refers to a Named Entity (Named Entity), and the Named Entity is a person name, an institution name, a place name, and all other entities identified by names, and the embodiments of the present disclosure mainly concern the identification of the Entity identified by the institution name in the text, and particularly concern certain specific institutions including, but not limited to, companies, schools, hospitals, television stations, stock exchanges, international organizations, and the like. In addition, the text to be processed in the embodiment of the present disclosure may be any text.
In embodiments of the present disclosure, the text to be processed may be a resume text, and the target institution entity that needs to identify the resume text may be a company entity and/or a school entity, that is, information of the company and/or the school of the person is identified from the resume text, and after identifying the company entity and/or the school entity, the resume text may be labeled for classifying the resume text, and the like, further .
Referring to fig. 1, fig. 1 is a schematic flowchart of a text entity identification method provided by an embodiment of the present disclosure, where the method of the embodiment of the present disclosure may be applied in a terminal device or a server, and the text entity identification method includes:
s101: and acquiring a text to be processed.
The text to be processed may be input into a terminal device or a server of the method through an input device, or for example, as shown in fig. 2, the terminal device 202 sends the text to be processed to the server 201, so as to perform subsequent entity identification processing by the server 201 and return a target institution entity contained in the text to be processed to the terminal device 202, where the terminal device 202 includes, but is not limited to, a notebook computer, a smart phone, a tablet computer, a personal digital assistant, and the like.
S102, identifying the mechanism entity full name in the text to be processed, and identifying the target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain an th category label mechanism entity.
In the embodiment of the disclosure, the target institution entity full name in the text to be processed may be first identified, specifically, all institution entity full names in the text to be processed may be first identified, where the all institution entity full names include the target institution entity full name and the non-target institution entity full name, since the target institution entity full name generally has the same or similar suffix, for example, for a company entity, the suffix is generally "company", "group", and for a school entity, the suffix is generally "university", "college", and the like, a preset target institution entity suffix dictionary (for example, a company entity suffix dictionary, a school entity suffix dictionary, and the like) may be obtained in advance, and , after identifying the institution entity full name in the text to be processed, the target institution entity full name may be selected from the all institution entity full names according to the preset target institution entity suffix dictionary, and if a certain institution entity full name ends with the suffix in the target institution entity suffix dictionary, the institution entity full name may be considered as the target institution entity full name.
In embodiments of the present disclosure, the organization entity full name in the text to be processed is identified by a language technology platform ltp (language technology platform).
Specifically, the language technology platform LTP may perform flow such as sentence (split), word (cut), Part-of-speech Tagging (POS), dependency syntax analysis (parser), Named Entity Recognition (NER) and the like on the text to be processed, so that the identification of the Entity in the text to be processed is realized, and the specific process is not described herein again.
The dictionary referred to in the above embodiment may be specifically acquired by the following procedure:
and carrying out statistical processing according to the target mechanism entity contained in the th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary.
In the embodiment of the present disclosure, the target institution entities contained in the database may be subjected to statistical processing to construct a target institution entity dictionary, and at the same time, suffixes are extracted from the target institution entities to construct a target institution entity suffix dictionary, wherein the database may be a database storing existing texts, which is used for statistics of common target institution entity names, for example, in a scenario of identifying a company and/or school entity in a resume text, the database may be a talent database of the company (in which resume text of personnel of the company may be stored).
, counting all target institution entities in the database, obtaining target institution entities with frequency higher than a predetermined threshold, and constructing the target institution entity dictionary and the target institution entity suffix dictionary according to the target institution entities with frequency higher than the predetermined threshold.
In the embodiment of the present disclosure, in consideration of the fact that the target mechanism entities conform to the long-tailed distribution in the th database, during the statistical processing according to the target mechanism entities contained in the th database, a high-frequency target mechanism entity may be acquired, a target mechanism entity dictionary may be constructed, and then a high-frequency tag mechanism entity suffix may be acquired, and a target mechanism entity suffix dictionary may be constructed, which satisfies the identification requirement for the high-frequency target mechanism entity.
Of course, in other embodiments of the present disclosure, other methods may be adopted to obtain the target institution entity suffix dictionary, which is not described herein again.
S103: and identifying the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second category target mechanism entity.
In the embodiment of the present disclosure, for the target mechanism entity abbreviation in the text to be processed, the target mechanism entity abbreviation can be identified through a preset target mechanism entity abbreviation dictionary, that is, whether the target mechanism entity abbreviation contained in the target mechanism entity abbreviation dictionary exists or not is identified from the text to be processed.
In embodiments of the present disclosure, an AC automaton (Aho-coral automation) may be used as the predetermined matching algorithm, where the AC automaton is well-known multimode matching algorithms, and the model prefix tree of the text to be processed is matched by using the AC automaton according to the dictionary for the object mechanism entity, so as to identify the abbreviation of the object mechanism entity included in the text to be processed.
The dictionary referred to in the above embodiment may be specifically acquired by the following procedure:
and according to the target mechanism entity suffix dictionary and/or a preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from the target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
In the embodiment of the disclosure, for any full target mechanism entity name in the target mechanism entity dictionary, a target mechanism entity suffix and a geographic prefix can be removed, so as to form a target mechanism entity name, for example, for "beijing byte jumping technology limited company", it can be determined that "beijing" is a geographic prefix according to a preset geographic prefix dictionary, it can be determined that "technology limited company" is a target mechanism entity suffix according to a target mechanism entity suffix dictionary, and a "byte jumping" is obtained after removing the prefix and the suffix, and is used as a short name of "beijing byte jumping technology limited company".
Of course, in other embodiments of the present disclosure, other methods may be used to obtain the target mechanism entity dictionary for short, which are not described herein again.
S104: and carrying out entity recognition on the text to be processed according to a pre-trained recognition model to obtain a third type target mechanism entity.
In the embodiment of the disclosure, based on the foregoing steps, most target mechanism entities in the text to be processed can be recognized, and in order to further increase the accuracy and recall in step , the text to be processed is input into a recognition model trained in advance, and entity recognition in step is performed through the recognition model, so as to recognize target mechanism entities that are not commonly found in the text to be processed and do not contain suffix words.
S105: and merging various target mechanism entities to serve as the target mechanism entities contained in the text to be processed, and outputting the target mechanism entities.
In the embodiment of the present disclosure, the various target mechanism entities obtained in the above steps are merged and output as a target mechanism entity included in the text to be processed.
For example, in a scenario of identifying a company and/or school entity in the resume text, the resume text may be labeled according to the output company entity and/or school entity, so as to classify the resume text.
The text entity identification method provided by the embodiment of the disclosure can be used for identifying the full name of the target mechanism entity and the short name of the target mechanism entity in the text to be processed by acquiring the text to be processed, identifying the full name of the mechanism entity in the full name of the mechanism entity according to a preset target mechanism entity suffix dictionary to obtain an -th category target mechanism entity, identifying the short name of the target mechanism entity in the text to be processed according to the preset target mechanism entity dictionary to obtain a second category target mechanism entity, identifying the entity in the text to be processed according to a pre-trained identification model to obtain a third category target mechanism entity, combining various target mechanism entities to serve as the target mechanism entity contained in the text to be processed and outputting the target mechanism entity.
Referring to fig. 3, fig. 3 is a schematic flowchart of a text entity identification method according to an embodiment of the present disclosure. On the basis of the above embodiments, the entity identification method of the text of the embodiments of the present disclosure includes:
s301, acquiring a text to be processed;
s302, identifying the full names of the mechanism entities in the text to be processed, and identifying the full names of the target mechanism entities in the full names of the mechanism entities according to a preset target mechanism entity suffix dictionary to obtain -th category label mechanism entities;
s303, identifying the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second category target mechanism entity;
s304, carrying out entity recognition on the text to be processed according to a pre-trained recognition model to obtain a third type target mechanism entity;
s305, checking any target mechanism entities in the various target mechanism entities according to a preset mechanism entity database, and if the target mechanism entities are contained in the mechanism entity database, determining the target mechanism entities as the target mechanism entities contained in the text to be processed;
s306, outputting the target mechanism entity contained in the text to be processed.
In the embodiment of the present disclosure, reference may be made to the above-mentioned embodiment in S301 to S304, which is not described herein again, and after S301 to S304, various target mechanism entities may be verified, where a standard of the verification is based on a preset mechanism entity database, a large number of mechanism entities (including full names and short names) may be maintained in the mechanism entity database, if a certain target mechanism entity is included in the mechanism entity database, it is determined that the verification is successful, the target mechanism entity is determined to be the target mechanism entity included in the text to be processed, and through the verification, words of non-target mechanism entities, for example, words ending with a target mechanism entity suffix, may be screened out by , so as to further improve accuracy and recall.
On the basis of any embodiment described above, the embodiment of the present disclosure is illustrated by paragraphs of actual texts to be processed, where the texts to be processed are as follows:
"Beijing byte jitter technology Ltd, established in 3 months 2012, is located in the Beijing Haisheng district. The same Shang Tang science and technology, the fourth paradigm and the ihandy are the same unicorn animal enterprises. The globalization layout of byte jumps started in 2015, and the coming sea of science and technology is the core strategy for globalization development of byte jumps. Byte jumping network has many high-tech talents, such as Qinghua university, Sian university, China academy of sciences, and so on. "
Through the entity recognition processing of the text of the above embodiment, the results shown in the following table can be obtained:
TABLE
Figure BDA0002244851270000081
Figure BDA0002244851270000091
It should be noted that in the process of identifying the company full name, "unicorn animal enterprise" is terminated by a high-frequency suffix and is mistakenly identified as the company full name, and the "unicorn animal enterprise" is judged not to belong to the organization entity database through the verification process, so that the process can be removed, and the accuracy of the identification result can be further improved .
In the embodiment of the disclosure, the recognition accuracy and recall rate of a school can be improved to more than 90% and about 60% for a company by means of an LTP + suffix dictionary, the accuracy and recall rate of the company are improved to more than 90% by using an AC automaton + abbreviation dictionary and reach 99% for a high-frequency company by steps, the accuracy of a result obtained after recognition by using a deep learning model is 85.95% and the recall rate is 91% by further advancing steps, and finally the accuracy is improved to 93% and the recall rate is unchanged by checking the verification module, the accuracy of the final overall recognition result is 93%, the recall rate is 91%, and the recall rate of a high-frequency school company entity is 99. the time efficiency is 0.02 s.
In embodiments of the present disclosure, a training method of recognition models is further provided, as shown in fig. 4, the specific process is as follows:
s401, acquiring training text data to be labeled;
s402, identifying a mechanism entity full name in the training text data to be labeled, and identifying a target mechanism entity full name in the training text data to be labeled according to the target mechanism entity suffix dictionary;
s403, identifying a target mechanism entity abbreviation in the training text data to be labeled according to the target mechanism entity abbreviation dictionary;
s404, marking the training text data to be marked according to the full name of the target mechanism entity and the short name of the target mechanism entity in the training text data to be marked;
s405, training the recognition model according to the marked training text data.
In the embodiment of the present disclosure, training text data to be labeled may be obtained from an open domain text (such as encyclopedia and the like) and other available texts (such as existing resume texts), and then entity recognition is performed on the training text data to be labeled by using S402 and S403, where the processes of S402 and S403 are similar to those of S102 and S103 in the above embodiment, that is, a full name of a mechanism entity in the training text data to be labeled may be recognized by a language technology platform LTP, and then a full name of a target mechanism entity may be screened out according to a suffix dictionary, or a short name of a target mechanism entity in the training text data to be labeled may be recognized according to the target mechanism entity short name dictionary and a predetermined matching algorithm; after identifying the full name of the target institution entity and the short name of the target institution entity in the training text data to be labeled, labeling the training text data to be labeled, that is, labeling the target institution entity included in the training text data, wherein a specific labeling manner may be, but is not limited to, BIO labeling (B-begin, I-inside, O-outside). By the marking method, manual marking can be avoided, and labor and time costs are saved. After the labeling of the training text data is completed, the recognition model is trained, and the training process can adopt any existing training process, which is not described herein again.
In embodiments of the present disclosure, as shown in fig. 5, the method for entity recognition of text further includes:
s501, performing statistical processing according to a target mechanism entity contained in an th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary;
s502, according to the target mechanism entity suffix dictionary and/or the preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from a target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
In the embodiment of the present disclosure, the dictionary related in the above embodiment can be acquired through S501 and S502, and the above embodiment can be specifically seen in detail. It should be noted that if only the target institution entity dictionary and/or the target institution entity suffix dictionary need to be acquired, only S501 may be executed; if the target institution entity dictionary and/or the target institution entity suffix word have been acquired by other methods, some steps in S501-S502 may be executed.
Fig. 6 is a block diagram of an entity recognition apparatus of a text provided in an embodiment of the present disclosure, which corresponds to the entity recognition method of a text in the above embodiment, for convenience of explanation, only parts related to the embodiment of the present disclosure are shown, and referring to fig. 6, the entity recognition apparatus 600 of a text includes an input module 601, an recognition module 602, a second recognition module 603, a third recognition module 604, and an output module 605.
The input module 601 is configured to obtain a text to be processed;
an recognition module 602, configured to recognize a full mechanism entity name in the text to be processed, and recognize a full target mechanism entity name in the full mechanism entity name according to a preset target mechanism entity suffix dictionary to obtain a -th category label mechanism entity;
the second recognition module 603 is configured to recognize a target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second category target mechanism entity;
a third recognition module 604, configured to perform entity recognition on the text to be processed according to a pre-trained recognition model, so as to obtain a third class target mechanism entity;
and an output module 605, configured to merge various target mechanism entities, serve as the target mechanism entities included in the text to be processed, and output the target mechanism entities.
In embodiments of the present disclosure, as shown in fig. 7, the apparatus 600 further comprises a dictionary acquisition module 607 for:
performing statistical processing according to a target mechanism entity contained in an th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary;
and according to the target mechanism entity suffix dictionary and/or a preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from the target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
In embodiments of the disclosure, the dictionary obtaining module 607, when obtaining the target institution entity dictionary and the target institution entity suffix dictionary according to the target institution entity contained in the database through statistical processing, is configured to:
counting all target mechanism entities contained in the th database, and acquiring the target mechanism entities with the frequency higher than a preset threshold value in all the target mechanism entities;
constructing the target institution entity dictionary and the target institution entity suffix dictionary based on the target institution entities having the frequency above a predetermined threshold.
In embodiments of the disclosure, the recognition module 602, when recognizing the organization entity full names in the text to be processed, is configured to:
and identifying the organization entity full name in the text to be processed through a Language Technology Platform (LTP).
In embodiments of the present disclosure, when recognizing the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary, the second recognition module 603 is configured to:
and identifying the target mechanism entity abbreviation in the text to be processed through a preset matching algorithm according to the target mechanism entity abbreviation dictionary.
In embodiments of the present disclosure, as shown in fig. 7, the apparatus 600 further comprises a training module 608 for:
acquiring training text data to be labeled;
identifying a mechanism entity full name in the training text data to be labeled, and identifying a target mechanism entity full name in the mechanism entity full names in the training text data to be labeled according to the target mechanism entity suffix dictionary;
identifying the target mechanism entity abbreviation in the training text data to be labeled according to the target mechanism entity abbreviation dictionary;
marking the training text data to be marked according to the full name of the target mechanism entity and the short name of the target mechanism entity in the training text data to be marked;
and training the recognition model according to the marked training text data.
In embodiments of the disclosure, the recognition model is a deep learning model consisting of at least a long short term memory network LSTM, a recurrent neural network RNN, and a conditional random field CRF.
In embodiments of the present disclosure, as shown in fig. 7, the apparatus 600 further comprises a verification module 606 for:
before outputting the target mechanism entities contained in the text to be processed, checking any target mechanism entities in the various target mechanism entities according to a preset mechanism entity database;
and if the target mechanism entity is contained in the mechanism entity database, determining that the target mechanism entity is the target mechanism entity contained in the text to be processed.
In embodiments of the present disclosure, the target facility entity includes a corporate entity and/or a school entity.
In embodiments of the present disclosure, the apparatus further comprises a tagging module to:
and marking the text to be processed according to a target mechanism entity contained in the output text to be processed.
The apparatus provided in the embodiment of the present disclosure may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present disclosure.
Referring to fig. 8, a schematic diagram of an electronic Device 800 suitable for implementing the embodiments of the present disclosure is shown, where the electronic Device 800 may be a terminal Device or a server, where the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a vehicle terminal (e.g., a car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc., the electronic Device shown in fig. 8 is only examples, and should not bring any limitations to the functions and use ranges of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 801 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage device 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
Embodiments of the present disclosure include, for example, computer program products comprising a computer program embodied on a computer-readable medium, the computer program containing program code for performing the method illustrated by the flow chart, in such embodiments the computer program may be downloaded and installed from a network via the communication means 809, or installed from the storage means 808, or from the ROM 802.
More specific examples of a computer readable storage medium may include, but are not limited to, an electrical connection having or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries or more programs which, when executed by the electronic device, cause the electronic device to perform the method of the above embodiments.
Computer program code for carrying out operations of the present disclosure may be written in or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or a combination thereof, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
It should also be noted that in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures, for example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved, it being noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, may be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The name of the unit does not in some cases form a limitation of the unit itself, for example, the th acquiring unit may also be described as a "unit acquiring at least two internet protocol addresses".
For example, without limitation, exemplary types of hardware logic components that may be used include field programmable arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and so forth.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
, according to or more embodiments of the present disclosure, there is provided an entity recognition method of texts, including:
acquiring a text to be processed;
recognizing the mechanism entity full name in the text to be processed, and recognizing the target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain an th category mark mechanism entity;
identifying a target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
performing entity recognition on the text to be processed according to a pre-trained recognition model to obtain a third type target mechanism entity;
and merging various target mechanism entities to serve as the target mechanism entities contained in the text to be processed, and outputting the target mechanism entities.
In accordance with or more embodiments of the present disclosure, the method further includes:
performing statistical processing according to a target mechanism entity contained in an th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary;
and according to the target mechanism entity suffix dictionary and/or a preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from the target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
According to or more embodiments of the present disclosure, the obtaining a target institution entity dictionary and the target institution entity suffix dictionary according to the statistical processing of the target institution entity contained in the database includes:
counting all target mechanism entities contained in the th database, and acquiring the target mechanism entities with the frequency higher than a preset threshold value in all the target mechanism entities;
constructing the target institution entity dictionary and the target institution entity suffix dictionary based on the target institution entities having the frequency above a predetermined threshold.
According to or more embodiments of the present disclosure, the identifying an organization entity full name in the text to be processed includes:
and identifying the organization entity full name in the text to be processed through a Language Technology Platform (LTP).
According to or more embodiments of the present disclosure, the recognizing a target institution entity abbreviation according to a target institution entity abbreviation dictionary in the text to be processed includes:
and identifying the target mechanism entity abbreviation in the text to be processed through a preset matching algorithm according to the target mechanism entity abbreviation dictionary.
In accordance with or more embodiments of the present disclosure, the method further includes:
acquiring training text data to be labeled;
identifying a mechanism entity full name in the training text data to be labeled, and identifying a target mechanism entity full name in the mechanism entity full names in the training text data to be labeled according to the target mechanism entity suffix dictionary;
identifying the target mechanism entity abbreviation in the training text data to be labeled according to the target mechanism entity abbreviation dictionary;
marking the training text data to be marked according to the full name of the target mechanism entity and the short name of the target mechanism entity in the training text data to be marked;
and training the recognition model according to the marked training text data.
According to or more embodiments of the disclosure, the recognition model is a deep learning model that is made up of at least a long short term memory network LSTM, a recurrent neural network RNN, and a conditional random field CRF.
According to or more embodiments of the present disclosure, before outputting the target institution entity contained in the text to be processed, the method further includes:
checking any target mechanism entities in the various target mechanism entities according to a preset mechanism entity database;
and if the target mechanism entity is contained in the mechanism entity database, determining that the target mechanism entity is the target mechanism entity contained in the text to be processed.
According to or more embodiments of the present disclosure, before outputting the target institution entity contained in the text to be processed, the method further includes:
the target institution entity includes a corporate entity and/or a school entity.
In accordance with or more embodiments of the present disclosure, the method further includes:
and marking the text to be processed according to a target mechanism entity contained in the output text to be processed.
In a second aspect, according to or more embodiments of the present disclosure, there is provided an entity recognition apparatus of texts, including:
the input module is used for acquiring a text to be processed;
an recognition module, configured to recognize a full mechanism entity name in the text to be processed, and recognize a full target mechanism entity name in the full mechanism entity name according to a preset target mechanism entity suffix dictionary to obtain a -th category label mechanism entity;
the second recognition module is used for recognizing the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
the third identification module is used for carrying out entity identification on the text to be processed according to a pre-trained identification model to obtain a third type target mechanism entity;
and the output module is used for combining various target mechanism entities, serving as the target mechanism entities contained in the text to be processed and outputting the target mechanism entities.
According to or more embodiments of the present disclosure, the apparatus further includes a dictionary acquisition module to:
performing statistical processing according to a target mechanism entity contained in an th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary;
and according to the target mechanism entity suffix dictionary and/or a preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from the target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
According to or more embodiments of the disclosure, the dictionary acquisition module, when performing statistical processing according to a target institution entity contained in the database to acquire a target institution entity dictionary and the target institution entity suffix dictionary, is configured to:
counting all target mechanism entities contained in the th database, and acquiring the target mechanism entities with the frequency higher than a preset threshold value in all the target mechanism entities;
constructing the target institution entity dictionary and the target institution entity suffix dictionary based on the target institution entities having the frequency above a predetermined threshold.
According to or more embodiments of the disclosure, the recognition module, when recognizing the organization entity full names in the text to be processed, is configured to:
and identifying the organization entity full name in the text to be processed through a Language Technology Platform (LTP).
According to or more embodiments of the present disclosure, when recognizing the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary, the second recognition module is configured to:
and identifying the target mechanism entity abbreviation in the text to be processed through a preset matching algorithm according to the target mechanism entity abbreviation dictionary.
In accordance with or more embodiments of the present disclosure, the apparatus further includes a training module to:
acquiring training text data to be labeled;
identifying a mechanism entity full name in the training text data to be labeled, and identifying a target mechanism entity full name in the mechanism entity full names in the training text data to be labeled according to the target mechanism entity suffix dictionary;
identifying the target mechanism entity abbreviation in the training text data to be labeled according to the target mechanism entity abbreviation dictionary;
marking the training text data to be marked according to the full name of the target mechanism entity and the short name of the target mechanism entity in the training text data to be marked;
and training the recognition model according to the marked training text data.
According to or more embodiments of the disclosure, the recognition model is a deep learning model that is made up of at least a long short term memory network LSTM, a recurrent neural network RNN, and a conditional random field CRF.
According to or more embodiments of the present disclosure, the apparatus further includes a verification module to:
before outputting the target mechanism entities contained in the text to be processed, checking any target mechanism entities in the various target mechanism entities according to a preset mechanism entity database;
and if the target mechanism entity is contained in the mechanism entity database, determining that the target mechanism entity is the target mechanism entity contained in the text to be processed.
According to or more embodiments of the present disclosure, the target facility entity includes a corporate entity and/or a school entity.
According to or more embodiments of the present disclosure, the apparatus further includes a tagging module to:
and marking the text to be processed according to a target mechanism entity contained in the output text to be processed.
In a third aspect, in accordance with or more embodiments of the present disclosure, there are provided electronic devices, including at least processors and memory;
the memory stores computer-executable instructions;
the at least processors execute the memory-stored computer-executable instructions that cause the at least processors to perform the entity recognition methods of text as described in the various possible designs of aspects and above.
In a fourth aspect, in accordance with or more embodiments of the present disclosure, there are provided computer-readable storage media having stored thereon computer-executable instructions that, when executed by a processor, implement methods of entity recognition of text as described in the various possible designs of and above.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the disclosure.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (13)

1, A text entity recognition method, comprising:
acquiring a text to be processed;
recognizing the mechanism entity full name in the text to be processed, and recognizing the target mechanism entity full name in the mechanism entity full name according to a preset target mechanism entity suffix dictionary to obtain an th category mark mechanism entity;
identifying a target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
performing entity recognition on the text to be processed according to a pre-trained recognition model to obtain a third type target mechanism entity;
and merging various target mechanism entities to serve as the target mechanism entities contained in the text to be processed, and outputting the target mechanism entities.
2. The method of claim 1, further comprising:
performing statistical processing according to a target mechanism entity contained in an th database to obtain a target mechanism entity dictionary and the target mechanism entity suffix dictionary;
and according to the target mechanism entity suffix dictionary and/or a preset geographic prefix dictionary, extracting a corresponding target mechanism entity abbreviation from the target mechanism entity full name in the target mechanism entity dictionary to obtain the target mechanism entity abbreviation dictionary.
3. The method of claim 2, wherein the obtaining a target institution entity dictionary and the target institution entity suffix dictionary based on statistical processing of target institution entities contained in the database comprises:
counting all target mechanism entities contained in the th database, and acquiring the target mechanism entities with the frequency higher than a preset threshold value in all the target mechanism entities;
constructing the target institution entity dictionary and the target institution entity suffix dictionary based on the target institution entities having the frequency above a predetermined threshold.
4. The method of claim 1, wherein the identifying the organization entity full names in the text to be processed comprises:
and identifying the organization entity full name in the text to be processed through a Language Technology Platform (LTP).
5. The method according to claim 1, wherein the identifying the target institution entity abbreviation in the text to be processed according to a preset target institution entity abbreviation dictionary comprises:
and identifying the target mechanism entity abbreviation in the text to be processed through a preset matching algorithm according to the target mechanism entity abbreviation dictionary.
6. The method of claim 1, further comprising:
acquiring training text data to be labeled;
identifying a mechanism entity full name in the training text data to be labeled, and identifying a target mechanism entity full name in the mechanism entity full names in the training text data to be labeled according to the target mechanism entity suffix dictionary;
identifying the target mechanism entity abbreviation in the training text data to be labeled according to the target mechanism entity abbreviation dictionary;
marking the training text data to be marked according to the full name of the target mechanism entity and the short name of the target mechanism entity in the training text data to be marked;
and training the recognition model according to the marked training text data.
7. The method according to claim 6, wherein the recognition model is a deep learning model consisting of at least a long short term memory network (LSTM), a Recurrent Neural Network (RNN) and a Conditional Random Field (CRF).
8. The method according to claim 1, before outputting the target institution entity contained in the text to be processed, further comprising:
checking any target mechanism entities in the various target mechanism entities according to a preset mechanism entity database;
and if the target mechanism entity is contained in the mechanism entity database, determining that the target mechanism entity is the target mechanism entity contained in the text to be processed.
9. The method of , wherein the target institution entity comprises a corporate entity and/or a school entity.
10. The method of any of , further comprising:
and marking the text to be processed according to a target mechanism entity contained in the output text to be processed.
An entity recognition apparatus for recognizing an entity of a text of type 11, , comprising:
the input module is used for acquiring a text to be processed;
an recognition module, configured to recognize a full mechanism entity name in the text to be processed, and recognize a full target mechanism entity name in the full mechanism entity name according to a preset target mechanism entity suffix dictionary to obtain a -th category label mechanism entity;
the second recognition module is used for recognizing the target mechanism entity abbreviation in the text to be processed according to a preset target mechanism entity abbreviation dictionary to obtain a second type target mechanism entity;
the third identification module is used for carrying out entity identification on the text to be processed according to a pre-trained identification model to obtain a third type target mechanism entity;
and the output module is used for combining various target mechanism entities, serving as the target mechanism entities contained in the text to be processed and outputting the target mechanism entities.
12, electronic device, comprising at least processors and memory;
the memory stores computer-executable instructions;
the at least processors executing the memory-stored computer-executable instructions cause the at least processors to perform the entity recognition method of text of any of claims 1-10 to .
A computer readable storage medium , wherein the computer readable storage medium has stored therein computer executable instructions which, when executed by a processor, implement a method of entity identification of text as claimed in any of claims 1-10 to .
CN201911013316.3A 2019-10-23 2019-10-23 Text entity identification method, text entity identification equipment and storage medium Pending CN110738055A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911013316.3A CN110738055A (en) 2019-10-23 2019-10-23 Text entity identification method, text entity identification equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911013316.3A CN110738055A (en) 2019-10-23 2019-10-23 Text entity identification method, text entity identification equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110738055A true CN110738055A (en) 2020-01-31

Family

ID=69271037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911013316.3A Pending CN110738055A (en) 2019-10-23 2019-10-23 Text entity identification method, text entity identification equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110738055A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651990A (en) * 2020-04-14 2020-09-11 车智互联(北京)科技有限公司 Entity identification method, computing equipment and readable storage medium
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
CN113177412A (en) * 2021-04-05 2021-07-27 北京智慧星光信息技术有限公司 Named entity identification method and system based on bert, electronic equipment and storage medium
CN113657100A (en) * 2021-07-20 2021-11-16 北京百度网讯科技有限公司 Entity identification method and device, electronic equipment and storage medium
US11675978B2 (en) 2021-01-06 2023-06-13 International Business Machines Corporation Entity recognition based on multi-task learning and self-consistent verification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357779A (en) * 2017-06-27 2017-11-17 北京神州泰岳软件股份有限公司 A kind of method and device for obtaining organization names
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN109299458A (en) * 2018-09-12 2019-02-01 广州多益网络股份有限公司 Entity recognition method, device, equipment and storage medium
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357779A (en) * 2017-06-27 2017-11-17 北京神州泰岳软件股份有限公司 A kind of method and device for obtaining organization names
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN109299458A (en) * 2018-09-12 2019-02-01 广州多益网络股份有限公司 Entity recognition method, device, equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111651990A (en) * 2020-04-14 2020-09-11 车智互联(北京)科技有限公司 Entity identification method, computing equipment and readable storage medium
CN111651990B (en) * 2020-04-14 2024-03-15 车智互联(北京)科技有限公司 Entity identification method, computing device and readable storage medium
CN111881669A (en) * 2020-06-24 2020-11-03 百度在线网络技术(北京)有限公司 Synonymy text acquisition method and device, electronic equipment and storage medium
US11675978B2 (en) 2021-01-06 2023-06-13 International Business Machines Corporation Entity recognition based on multi-task learning and self-consistent verification
CN113177412A (en) * 2021-04-05 2021-07-27 北京智慧星光信息技术有限公司 Named entity identification method and system based on bert, electronic equipment and storage medium
CN113657100A (en) * 2021-07-20 2021-11-16 北京百度网讯科技有限公司 Entity identification method and device, electronic equipment and storage medium
CN113657100B (en) * 2021-07-20 2023-12-15 北京百度网讯科技有限公司 Entity identification method, entity identification device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110738055A (en) Text entity identification method, text entity identification equipment and storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
US20150161512A1 (en) Mining Forums for Solutions to Questions
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN111488468B (en) Geographic information knowledge point extraction method and device, storage medium and computer equipment
CN108932218B (en) Instance extension method, device, equipment and medium
CN108549723B (en) Text concept classification method and device and server
US10592236B2 (en) Documentation for version history
CN111079408B (en) Language identification method, device, equipment and storage medium
CN109190123B (en) Method and apparatus for outputting information
CN113836925A (en) Training method and device for pre-training language model, electronic equipment and storage medium
JP2020071839A (en) Search device, search method, search program, and recording medium
CN111415747A (en) Electronic medical record construction method and device
CN110895587B (en) Method and device for determining target user
CN111325031A (en) Resume parsing method and device
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
CN113011169A (en) Conference summary processing method, device, equipment and medium
CN111783425A (en) Intention identification method based on syntactic analysis model and related device
CN110866394A (en) Company name identification method and device, computer equipment and readable storage medium
US9898457B1 (en) Identifying non-natural language for content analysis
CN110826330B (en) Name recognition method and device, computer equipment and readable storage medium
CN111401034B (en) Semantic analysis method, semantic analysis device and terminal for text
CN114139543A (en) Entity link corpus labeling method and device
CN112148751B (en) Method and device for querying data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination