CN115293158B - Label-assisted disambiguation method and device - Google Patents

Label-assisted disambiguation method and device Download PDF

Info

Publication number
CN115293158B
CN115293158B CN202210758371.0A CN202210758371A CN115293158B CN 115293158 B CN115293158 B CN 115293158B CN 202210758371 A CN202210758371 A CN 202210758371A CN 115293158 B CN115293158 B CN 115293158B
Authority
CN
China
Prior art keywords
word
word segmentation
words
entity
disambiguated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210758371.0A
Other languages
Chinese (zh)
Other versions
CN115293158A (en
Inventor
夏煜
龙非池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rocking Digital Chongqing Technology Co ltd
Original Assignee
Rocking Digital Chongqing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rocking Digital Chongqing Technology Co ltd filed Critical Rocking Digital Chongqing Technology Co ltd
Priority to CN202210758371.0A priority Critical patent/CN115293158B/en
Publication of CN115293158A publication Critical patent/CN115293158A/en
Application granted granted Critical
Publication of CN115293158B publication Critical patent/CN115293158B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, and provides a disambiguation method and device based on label assistance, wherein the method comprises the following steps: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology; determining a plurality of vocabulary labels corresponding to entity words to be disambiguated from a preset entity word label library; calculating the similarity between a plurality of vocabulary labels and the word segmentation set respectively, and determining the target similarity; and taking the vocabulary labels corresponding to the target similarity as a disambiguation result of the entity words to be disambiguated. Compared with the prior art, the disambiguation method and device based on label assistance provided by the invention realize the acquisition of accurate disambiguation results, so that the user has definite meaning on the acquired entity, and the accuracy of entity information is improved.

Description

Label-assisted disambiguation method and device
Technical Field
The invention relates to the technical field of natural language processing, in particular to a disambiguation method and device based on label assistance.
Background
The natural language processing is often referred to as a word multi-meaning phenomenon in language, which affects the application of the natural language processing fields such as machine translation, automatic abstract, question-answering system, public opinion analysis, machine writing, information retrieval, text classification and the like with chapter understanding capability. In order to make the above application fields have better accuracy or obtain the result more in line with the expectations of people, the disambiguation process is performed on words with various semantics.
An Entity (Entity) refers to something that exists objectively and is distinguishable from each other, including specific people, things, abstract concepts or links, and various categories of entities are contained in a knowledge base. Entity disambiguation (also known as semantic disambiguation) is a technique dedicated to solving the problem of ambiguity arising from homonymous entities. In an actual language environment, a problem that a certain entity name corresponds to a plurality of named entity objects is often encountered.
The semantics of the obtained entity are ambiguous by the user, so that the accuracy of the entity information obtained by the user is not high.
Disclosure of Invention
The invention aims to provide a disambiguation method and device based on label assistance, which are used for solving the problem that in the prior art, the accuracy of entity information acquired by a user is low due to the fact that the semantics of the acquired entity are ambiguous.
In order to achieve the above object, the technical scheme adopted by the embodiment of the invention is as follows:
in a first aspect, an embodiment of the present invention provides a tag-assisted disambiguation method, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology; determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library; calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity; and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
In a second aspect, an embodiment of the present invention provides a tag-based auxiliary disambiguation device, including: the word segmentation extraction module is used for obtaining a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology; the vocabulary tag determining module is used for determining a plurality of vocabulary tags corresponding to the entity words to be disambiguated from a preset entity word tag library; the similarity calculation module is used for calculating the similarity between the plurality of vocabulary labels and the word segmentation set respectively and determining target similarity; and the entity word disambiguation module is used for taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
Compared with the prior art, the embodiment of the invention has the following beneficial effects:
according to the tag-assisted disambiguation method and device provided by the embodiment of the invention, the word segmentation technology is utilized to extract the entity words to be disambiguated and the word segmentation set in the document to be processed, a plurality of vocabulary tags corresponding to the entity words to be disambiguated are determined from the preset entity word tag library, the similarity between the vocabulary tags and the word segmentation set is calculated, the target similarity is determined, and finally the vocabulary tags corresponding to the target similarity are used as the disambiguation result of the entity words to be disambiguated, so that the accurate disambiguation result is obtained, the acquired entity semantics are clear for the user, and the accuracy of entity information is improved.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings can be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 shows a block schematic diagram of an electronic device according to an embodiment of the present invention;
FIG. 2 shows a flow chart of a tag-based assisted disambiguation method provided by an embodiment of the present invention;
FIG. 3 is a sub-step flow chart of step S2 shown in FIG. 2;
FIG. 4 is a sub-step flow chart of step S3 shown in FIG. 2;
FIG. 5 is a sub-step flow chart of step S4 shown in FIG. 2;
fig. 6 shows a schematic structural diagram of a tag-based disambiguation device according to an embodiment of the present invention;
reference numerals: 100-an electronic device; a 101-processor; 102-memory; 103-bus; 104-a communication interface; 105-a display screen; 200-tag-based assisted disambiguation means; 201-an interference item processing module; 202-word segmentation extraction module; 203-a vocabulary tag determination module; 204-a similarity calculation module; 205-entity word disambiguation module.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
The disambiguation method based on label assistance provided by the embodiment of the invention is applied to electronic equipment, wherein the electronic equipment can be, but is not limited to, a smart phone, a tablet personal computer, a vehicle-mounted computer, a personal digital assistant (personal digital assistant, PDA) and the like. Referring to fig. 1, fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention, and an electronic device 100 includes a processor 101, a memory 102, a bus 103, a communication interface 104, and a display screen 105. The processor 101, the memory 102, the communication interface 104 and the display screen 105 are connected via a bus 103, the processor 101 being adapted to execute executable modules, such as computer programs, stored in the memory 102.
The processor 101 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the tag-assisted disambiguation method may be performed by integrated logic circuitry based on hardware in the processor 101 or by instructions in the form of software. The processor 101 may be a general-purpose processor 101, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a digital signal processor (Digital Signal Processor, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
The memory 102 may comprise high-speed random access memory (RAM: random Access Memory) and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. The Memory 102 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The bus 103 may be a ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, a EISA (Extended Industry Standard Architecture) bus, or the like. Only one double-headed arrow is shown in fig. 1, but not only one bus 103 or one type of bus 103.
The electronic device 100 enables a communication connection between the electronic device 100 and an external device through at least one communication interface 104 (which may be wired or wireless). The memory 102 is used to store programs, such as the tag-assisted disambiguation device 200. The tag-based disambiguation apparatus 200 includes at least one software function module that may be stored in the memory 102 in the form of software or firmware (firmware) or cured in an Operating System (OS) of the electronic device 100. The processor 101, upon receiving the execution instruction, executes the program to implement a tag-assisted based disambiguation method.
The display screen 105 is used for display, and the displayed content may be some processing result of the processor 101. The display screen 105 may be a touch display screen, a display screen without interactive functionality, or the like. The display screen 105 may display the engineering information segment, the document to be processed, and the disambiguation result.
It should be understood that the architecture shown in fig. 1 is merely a schematic illustration of an architecture application of the electronic device 100, and that the electronic device 100 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
First embodiment
Referring to fig. 2, fig. 2 shows a flowchart of a tag-assisted disambiguation method according to an embodiment of the present invention. The label-assisted disambiguation method comprises the following steps:
s1, acquiring engineering information fragments, and performing interference item removal processing on the engineering information fragments to obtain a document to be processed.
In the embodiment of the invention, the engineering information fragment can be a network fragment comprising a website, letters, characters, symbols, numbers, pictures, spaces and the like. The document to be processed may be text content in the engineering information segment. The step of obtaining the engineering information fragment and performing interference elimination processing on the engineering information fragment to obtain the document to be processed can be understood as performing interference elimination processing on the engineering information fragment containing information such as website, letters, characters, symbols, numbers, pictures, spaces and the like, and filtering the information such as the website, letters, symbols, numbers, pictures, spaces and the like to obtain the document to be processed containing only the characters. The engineering information segments may be stored in the memory 102 within the electronic device 100 or may be received via the communication interface 104 and transmitted by other electronic devices 100.
The specific codes are as follows:
import re
reading one of list, # to str
csv_text=str(csv_data_list[i])
# matches a digital character. Equivalent to [0-9], and deleted. + represents a number of matches; lower changes letters to lower case
csv_text=re.sub(r'([\d]+)',",csv_text).lower()
Match miscellaneous and delete
csv_text=re.sub("[A-Za-z0-9\!\%\,\。\...+\..\.+\_+\##\.\?\【\】\'\<\>\=\:\/\&\"\\-\'\\r\\n]","",csv_text)
# match [ ] is replaced by comma
csv_text=re.sub('[\[\]]',',',csv_text)
# match \delete
csv_text=re.sub(r'\\',",csv_text)
For example:
the input acquired engineering information fragments are as follows:
"< -! Is-jrj _final_title_start- > < p > asking the apple about the item with new energy? "p"
Output document to be processed:
"please ask apple for related items of new energy".
By processing the interference item of the engineering information fragment, the obtained document to be processed only containing the text reduces the workload of post-disambiguation and effectively improves the disambiguation efficiency.
S2, acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology.
In the embodiment of the invention, the entity words to be disambiguated can be homonymous entity nouns in the document to be processed, such as "apple", "millet", "meta universe", "bean", "himalaya", and the like. "apple" may refer to apple company as well as apple fruit; "millet" may refer to both millet company and millet grain; "Meta universe" may refer to both Meta universe companies and virtual digital living spaces; the "bean paste" may refer to not only a bean paste company, but also a bean paste seasoning, and also a bean paste net; "Himalayan" may refer to both Himalayan corporation and Himalayan mountain, and Himalayan platform. The word set may be all the words in the document to be processed except for the entity word to be disambiguated. For example, when the document to be processed is "ask apple for related item with new energy", the "ask", "apple", "have", "new energy", "related", "item", "mock", and "apple" are determined as entity words to be disambiguated, and the word segmentation set is "ask", "have", "new energy", "related", "item", "mock".
In the embodiment of the invention, the steps of acquiring a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology can be understood as storing an entity word library in advance, wherein a plurality of entity words with the same name and the same meaning are arranged in the entity word library, segmenting the acquired document to be processed to obtain a plurality of word segmentation words, comparing the plurality of word segmentation words with the plurality of entity words with the same name and the same meaning in the entity word library stored in advance, taking the word segmentation words consistent with the prestored entity words with the same name and the same meaning as the entity words to be disambiguated, and taking the rest word segmentation words as the word segmentation sets.
Referring to fig. 3, step S2 may further include the following sub-steps:
and S21, acquiring a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation vocabularies and word segmentation parts-of-speech corresponding to each word segmentation vocabularies.
In the embodiment of the invention, the Hanlp word segmentation algorithm comprises standard word segmentation, NLP word segmentation, index word segmentation, N-shortest path word segmentation, CRF word segmentation, extreme speed dictionary word segmentation and the like. The part of speech of the word may be an adjective, a sub-adjective, an adjective morpheme, an adjective idiom, a name adjective, a differential, an exclamation, an conjunctive, a parallel conjunctive, and the like. The part-of-speech correspondence table is pre-stored in the Hanlp segmentation model, refer to table 1, and table 1 is a part of the Hanlp segmentation part-of-speech correspondence table.
TABLE 1
(symbol) Description of the invention
a Adjectives
ad Side shape word
ag Morpheme of adjective part of speech
al Idioms of adjective type
an Name-shape word
b Distinctions words
begin For start # start only
bg Differentiated morpheme
bl Idioms distinguishing part of speech
c Conjunctions
cc Parallel conjunctions
d Adverbs and method of making
dg Adverbs such as Merry, club, complex and the like
dl Continuous language
e Sighing mark
end For termination # termination only
The step of obtaining a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation words and word segmentation parts of speech corresponding to each word segmentation word can be understood as the step of performing word segmentation on the document to be processed by utilizing the word segmentation algorithm to obtain a plurality of word segmentation words, and obtaining the word segmentation parts of speech of each word segmentation word according to the Hanlp word part-of-speech tagging table.
S22, determining a target word part of speech from the plurality of word part of speech, taking word segmentation vocabularies corresponding to the target word part of speech as entity words to be disambiguated, and taking the rest word segmentation vocabularies as word segmentation sets.
In the embodiment of the invention, the part of speech of the target word is identical with the part of speech of the preset word in the part of speech of the plurality of word fragments, and the preset word part of speech can be a special part of speech representing the homonymous and the heteronymous entity words. And taking the word segmentation vocabulary corresponding to the part of speech of the target word segmentation as the entity word to be disambiguated, and taking the rest word segmentation vocabulary as a word segmentation set.
S3, determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from a preset entity word label library.
In the embodiment of the invention, the preset entity word tag library may be a plurality of vocabulary tags corresponding to a plurality of homonymous and heteronymous entity words. The vocabulary tag can represent related information, industry information and the like of the homonymous and heteronymous entity words. The step of determining a plurality of vocabulary labels corresponding to the entity words to be disambiguated from the preset entity word label library may be understood as comparing the entity words to be disambiguated with a plurality of homonymous entity words stored in the preset entity word label library to obtain a plurality of vocabulary labels corresponding to homonymous and heteronymous entity words consistent with the entity words to be disambiguated.
The preset entity word tag library can further comprise a preset word related library and a preset word classification library, wherein the preset word related library comprises a plurality of first words and a plurality of related words corresponding to each first word, and the preset word classification library comprises a plurality of second words and at least one industry category corresponding to each second word. Referring to fig. 4, step S3 may further include the following sub-steps:
s31, comparing a plurality of first vocabularies in a preset vocabulary related library with the entity words to be disambiguated, and obtaining target first vocabularies consistent with the entity words to be disambiguated.
In the embodiment of the invention, the first vocabulary represents entity words with the same name and objection in the preset vocabulary related library, and the target first vocabulary is the first vocabulary consistent with the entity words to be disambiguated in the preset vocabulary related library.
S32, comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated.
In the embodiment of the invention, the second vocabulary represents the entity words with the same name and different meaning in the preset vocabulary classification library, and the target second vocabulary is the second vocabulary consistent with the entity words to be disambiguated in the preset vocabulary classification library.
S33, taking a plurality of related vocabularies corresponding to the target first vocabularies and at least one industry category corresponding to the target second vocabularies as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
In the embodiment of the present invention, the related vocabulary represents the related information of the first vocabulary, for example, when the first vocabulary is "apple", the related vocabulary may be, but is not limited to, glory, hua, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, and the like. Industry category characterizes industry classification information of the second vocabulary, e.g., when the second vocabulary is "apple", the industry category may be the scientific industry, the food industry, etc. It should be noted that, the target first vocabulary and the target second vocabulary refer to the same homonymic entity word. The step of taking the plurality of related vocabularies corresponding to the target first vocabulary and at least one industry category corresponding to the target second vocabulary as a plurality of vocabulary tags corresponding to the entity word to be disambiguated may be understood as adding the plurality of related vocabularies corresponding to the target first vocabulary and the at least one industry category corresponding to the target second vocabulary to obtain a plurality of vocabulary tags corresponding to the entity word to be disambiguated. Preferably, the multiple vocabulary labels corresponding to the obtained entity words to be disambiguated can be subjected to disambiguation processing, repeated vocabulary labels are deleted, only one vocabulary label is reserved, so that the repeated data processing in the later period is reduced, and the disambiguation efficiency is improved.
S4, calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity.
In the embodiment of the invention, the target similarity represents the maximum similarity in the similarity between the vocabulary tag and the word segmentation set.
Referring to fig. 5, step S4 may include the following sub-steps:
s41, calculating the similarity of each tag word and the word segmentation set to obtain the similarity of each tag word.
The character string comparison function compare is formed by weighting 0.4 times cosine similarity, 0.3 times editing distance similarity and 0.3 times serialization matching, and the specific codes are as follows:
def compare(str1,str2):
if str1==str2:
return 1.0
where str1, str2 are character strings of two-component words, str1 refers to a word-segmentation set, str2 refers to a plurality of vocabulary tags
diff_result=difflib.SequenceMatcher(None,str1,str2).ratio()
cos_result=cos_sim(str1,str2)
edit_reslut=edit_similar(str1,str2)
return cos_result*0.4+edit_reslut*0.3+0.3*diff_result
And obtaining the similarity of each tag word according to the returned result.
S42, comparing all the similarities, and taking the maximum similarity as the target similarity.
For example, when the entity word to be disambiguated is "apple", the vocabulary label is "glory, hua, company, mobile phone, watch, banana, pear, grape, fruit tree, research and development, qiao Busi, scientific industry, food industry", the word segmentation set is "please ask, have, new energy, related, project, mock", the similarity of the vocabulary label "glory" and the word segmentation set is 0.444434, the similarity of the vocabulary label "hua" and the word segmentation set is 0.476431, the similarity of the vocabulary label "company" and the word segmentation set is 0.730766, the similarity of the vocabulary label "mobile phone" and the word segmentation set is 0.286301, the similarity of the vocabulary label 'watch' and the word segmentation set is 0.283275, the similarity of the vocabulary label 'banana' and the word segmentation set is 0.186331, the similarity of the vocabulary label 'pear' and the word segmentation set is 0.156289, the similarity of the vocabulary label 'grape' and the word segmentation set is 0.169347, the similarity of the vocabulary label 'fruit tree' and the word segmentation set is 0.489634, the similarity of the vocabulary label 'research and development' and the word segmentation set is 0.605594, the similarity of the vocabulary label 'Qiao Busi' and the word segmentation set is 0.487695, the similarity of the vocabulary label 'technical industry' and the word segmentation set is 0.444434, and the similarity of the vocabulary label 'food industry' and the word segmentation set is 0.320620. The maximum similarity, i.e., target similarity, is 0.730766.
S5, taking the vocabulary labels corresponding to the target similarity as a disambiguation result of the entity words to be disambiguated.
In the above example, the vocabulary label corresponding to the target similarity 0.730766 is "company", and then the disambiguation result of the entity word "apple" to be disambiguated is "company", i.e. apple company.
Compared with the prior art, the embodiment of the invention has the following advantages:
firstly, the interference item processing is carried out on the engineering information fragment, so that the obtained document to be processed only containing the characters reduces the workload of post-disambiguation, and effectively improves the disambiguation efficiency.
Secondly, extracting entity words to be disambiguated and word segmentation sets in the document to be processed by utilizing a word segmentation technology, determining a plurality of vocabulary tags corresponding to the entity words to be disambiguated from a preset entity word tag library, calculating the similarity of the vocabulary tags and the word segmentation sets respectively, determining target similarity, and finally taking the vocabulary tags corresponding to the target similarity as disambiguation results of the entity words to be disambiguated, thereby realizing the acquisition of accurate disambiguation results, ensuring that a user has definite entity semantics for the acquired entity words, and improving the accuracy of entity information.
Second embodiment
Referring to fig. 6, fig. 6 is a block schematic diagram of a tag-based disambiguation device according to an embodiment of the present invention. The tag-based assist disambiguation apparatus 200 includes an distracter processing module 201, a word segmentation module 202, a vocabulary tag determination module 203, a similarity calculation module 204, and an entity word disambiguation module 205.
The interference item processing module 201 is configured to obtain an engineering information segment, and perform interference item removal processing on the engineering information segment to obtain a document to be processed.
It is understood that the interference item processing module 201 may perform the above step S1.
The word segmentation extraction module 202 is configured to obtain a document to be processed, and extract a word of an entity to be disambiguated and a word segmentation set in the document to be processed by using a word segmentation technology.
It is understood that the word segmentation extraction module 202 may perform the above step S2.
In the embodiment of the present invention, the word segmentation extraction module 202 is specifically configured to: acquiring a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation words and word segmentation parts of speech corresponding to each word segmentation word; determining a target word part of speech from the plurality of word part of speech, taking word parts of speech corresponding to the target word part of speech as entity words to be disambiguated, and taking the rest word parts of speech as word parts of speech set.
The vocabulary tag determining module 203 is configured to determine a plurality of vocabulary tags corresponding to the entity word to be disambiguated from a preset entity word tag library.
It is understood that the vocabulary tag determination module 203 may perform the step S3 described above.
In the embodiment of the invention, the preset entity word tag library comprises a preset word related library and a preset word classification library, wherein the preset word related library comprises a plurality of first words and a plurality of related words corresponding to each first word, and the preset word classification library comprises a plurality of second words and at least one industry category corresponding to each second word. The vocabulary tag determining module 203 is specifically configured to: comparing a plurality of first vocabularies in a preset vocabulary related library with entity words to be disambiguated to obtain target first vocabularies consistent with the entity words to be disambiguated; comparing a plurality of second words in a preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated; and taking at least one industry category corresponding to the plurality of related vocabularies corresponding to the target first vocabularies and the target second vocabularies as a plurality of vocabulary labels corresponding to the entity words to be disambiguated.
The similarity calculation module 204 is configured to calculate the similarity between the plurality of vocabulary tags and the word segmentation set, and determine the target similarity.
It is understood that the similarity calculation module 204 may perform the step S4 described above.
In the embodiment of the present invention, the similarity calculation module 204 is specifically configured to: calculating the similarity of each tag word and the word segmentation set to obtain the similarity of each tag word; all the similarities are compared, and the maximum similarity is taken as the target similarity.
The entity word disambiguation module 205 is configured to take the vocabulary tag corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
It is understood that the entity word disambiguation module 205 may perform step S5 described above.
In summary, an embodiment of the present invention provides a tag-assisted disambiguation method and apparatus, where the method includes: acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology; determining a plurality of vocabulary labels corresponding to entity words to be disambiguated from a preset entity word label library; calculating the similarity between a plurality of vocabulary labels and the word segmentation set respectively, and determining the target similarity; and taking the vocabulary labels corresponding to the target similarity as a disambiguation result of the entity words to be disambiguated. Compared with the prior art, the disambiguation method based on label assistance provided by the embodiment of the invention has the following advantages: firstly, the interference item processing is carried out on the engineering information fragment, so that the obtained document to be processed only containing the characters reduces the workload of post-disambiguation, and effectively improves the disambiguation efficiency. And secondly, accurate disambiguation result acquisition is realized, so that the user has definite meaning on the acquired entity, and the accuracy of the entity information is improved.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

Claims (8)

1. A tag-assisted disambiguation method, the method comprising:
acquiring a document to be processed, and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology;
the preset entity word tag library comprises a preset word related library and a preset word classification library, wherein the preset word related library comprises a plurality of first words and a plurality of related words corresponding to each first word, and the preset word classification library comprises a plurality of second words and at least one industry category corresponding to each second word; comparing a plurality of first vocabularies in the preset vocabulary related library with the entity words to be disambiguated to obtain target first vocabularies consistent with the entity words to be disambiguated;
comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated;
taking a plurality of related vocabularies corresponding to the target first vocabularies and at least one industry category corresponding to the target second vocabularies as a plurality of vocabulary labels corresponding to the entity words to be disambiguated;
calculating the similarity between the vocabulary labels and the word segmentation set respectively, and determining the target similarity;
and taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
2. The method of claim 1, wherein the steps of obtaining a document to be processed and extracting the entity word to be disambiguated and the word segmentation set in the document to be processed using word segmentation techniques comprise:
acquiring a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation words and word segmentation parts-of-speech corresponding to each word segmentation word;
determining a target word part of speech from a plurality of word part of speech, taking word segmentation vocabulary corresponding to the target word part of speech as entity words to be disambiguated, and taking the rest word segmentation vocabulary as word segmentation set.
3. The method of claim 1, wherein the step of calculating the similarity of the plurality of vocabulary tags to the segmented set, respectively, and determining a target similarity comprises:
calculating the similarity of each tag word and the word segmentation set to obtain the similarity of each tag word;
all the similarities are compared, and the maximum similarity is taken as the target similarity.
4. A method according to any one of claims 1-3, wherein prior to the step of obtaining a document to be processed and extracting the entity words to be disambiguated and the set of words in the document to be processed using a word segmentation technique, the method further comprises:
and obtaining engineering information fragments, and carrying out interference item removal processing on the engineering information fragments to obtain a document to be processed.
5. A tag-based assisted disambiguation apparatus, the tag-based assisted disambiguation apparatus comprising:
the word segmentation extraction module is used for obtaining a document to be processed and extracting entity words to be disambiguated and word segmentation sets in the document to be processed by using a word segmentation technology;
the word label determining module is used for dividing a preset entity word label library into a preset word related library and a preset word classifying library, wherein the preset word related library comprises a plurality of first words and a plurality of related words corresponding to each first word, and the preset word classifying library comprises a plurality of second words and at least one industry class corresponding to each second word;
comparing a plurality of first vocabularies in the preset vocabulary related library with the entity words to be disambiguated to obtain target first vocabularies consistent with the entity words to be disambiguated;
comparing a plurality of second words in the preset word classification library with the entity words to be disambiguated to obtain target second words consistent with the entity words to be disambiguated;
taking a plurality of related vocabularies corresponding to the target first vocabularies and at least one industry category corresponding to the target second vocabularies as a plurality of vocabulary labels corresponding to the entity words to be disambiguated;
the similarity calculation module is used for calculating the similarity between the plurality of vocabulary labels and the word segmentation set respectively and determining target similarity;
and the entity word disambiguation module is used for taking the vocabulary label corresponding to the target similarity as a disambiguation result of the entity word to be disambiguated.
6. The apparatus of claim 5, wherein the word segmentation extraction module is specifically configured to:
acquiring a document to be processed, and performing word segmentation and part-of-speech tagging on the document to be processed by utilizing a Hanlp word segmentation algorithm to obtain a plurality of word segmentation words and word segmentation parts-of-speech corresponding to each word segmentation word;
determining a target word part of speech from a plurality of word part of speech, taking word segmentation vocabulary corresponding to the target word part of speech as entity words to be disambiguated, and taking the rest word segmentation vocabulary as word segmentation set.
7. The apparatus of claim 5, wherein the similarity calculation module is specifically configured to:
calculating the similarity of each tag word and the word segmentation set to obtain the similarity of each tag word;
all the similarities are compared, and the maximum similarity is taken as the target similarity.
8. The apparatus according to any one of claims 5 to 7, further comprising an interference term processing module, wherein the interference term processing module is configured to obtain a project information fragment, and perform interference term removal processing on the project information fragment to obtain a document to be processed.
CN202210758371.0A 2022-06-30 2022-06-30 Label-assisted disambiguation method and device Active CN115293158B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210758371.0A CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210758371.0A CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Publications (2)

Publication Number Publication Date
CN115293158A CN115293158A (en) 2022-11-04
CN115293158B true CN115293158B (en) 2024-02-02

Family

ID=83823162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210758371.0A Active CN115293158B (en) 2022-06-30 2022-06-30 Label-assisted disambiguation method and device

Country Status (1)

Country Link
CN (1) CN115293158B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN112395421A (en) * 2021-01-21 2021-02-23 平安科技(深圳)有限公司 Course label generation method and device, computer equipment and medium
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112966054A (en) * 2021-02-07 2021-06-15 撼地数智(重庆)科技有限公司 Enterprise graph node relation-based ethnic group division method and computer equipment
CN114547338A (en) * 2022-02-22 2022-05-27 撼地数智(重庆)科技有限公司 Method for identifying uniqueness of industrial and commercial main body

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7031909B2 (en) * 2002-03-12 2006-04-18 Verity, Inc. Method and system for naming a cluster of words and phrases

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN111738009A (en) * 2019-03-19 2020-10-02 百度在线网络技术(北京)有限公司 Method and device for generating entity word label, computer equipment and readable storage medium
CN112464669A (en) * 2020-12-07 2021-03-09 宁波深擎信息科技有限公司 Stock entity word disambiguation method, computer device and storage medium
CN112395421A (en) * 2021-01-21 2021-02-23 平安科技(深圳)有限公司 Course label generation method and device, computer equipment and medium
CN112966054A (en) * 2021-02-07 2021-06-15 撼地数智(重庆)科技有限公司 Enterprise graph node relation-based ethnic group division method and computer equipment
CN114547338A (en) * 2022-02-22 2022-05-27 撼地数智(重庆)科技有限公司 Method for identifying uniqueness of industrial and commercial main body

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多策略中文微博实体词消歧及实体链接;向宇;郭云龙;徐潇;曾维刚;李莉;;计算机应用与软件(第08期);18-23 *

Also Published As

Publication number Publication date
CN115293158A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
TWI636452B (en) Method and system of voice recognition
CN108460011B (en) Entity concept labeling method and system
CN109460551B (en) Signature information extraction method and device
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN110741376B (en) Automatic document analysis for different natural languages
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN107577663B (en) Key phrase extraction method and device
US11393237B1 (en) Automatic human-emulative document analysis
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
US10528609B2 (en) Aggregating procedures for automatic document analysis
CN111241230A (en) Method and system for identifying string mark risk based on text mining
CN114861677B (en) Information extraction method and device, electronic equipment and storage medium
CN112434520A (en) Named entity recognition method and device and readable storage medium
CN111984845A (en) Website wrongly-written character recognition method and system
CN106372232B (en) Information mining method and device based on artificial intelligence
CN114861635A (en) Chinese spelling error correction method, device, equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
US11676231B1 (en) Aggregating procedures for automatic document analysis
CN115293158B (en) Label-assisted disambiguation method and device
CN116150394A (en) Knowledge extraction method, device, storage medium and equipment for knowledge graph
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114064906A (en) Emotion classification network training method and emotion classification method
Dhanalakshmi et al. Chunker for tamil
CN111814025A (en) Viewpoint extraction method and device
CN114417869A (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant