CN113743107B - Entity word extraction method and device and electronic equipment - Google Patents

Entity word extraction method and device and electronic equipment Download PDF

Info

Publication number
CN113743107B
CN113743107B CN202111007981.9A CN202111007981A CN113743107B CN 113743107 B CN113743107 B CN 113743107B CN 202111007981 A CN202111007981 A CN 202111007981A CN 113743107 B CN113743107 B CN 113743107B
Authority
CN
China
Prior art keywords
word
candidate
entity word
entity
candidate entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111007981.9A
Other languages
Chinese (zh)
Other versions
CN113743107A (en
Inventor
井玉欣
董伟
沈雨奇
刘江伟
王枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202111007981.9A priority Critical patent/CN113743107B/en
Publication of CN113743107A publication Critical patent/CN113743107A/en
Application granted granted Critical
Publication of CN113743107B publication Critical patent/CN113743107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method, a device and electronic equipment for extracting entity words. One embodiment of the method comprises the following steps: acquiring a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set; extracting word characteristics of each candidate entity word in the candidate entity word set; and selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word. The embodiment improves the accuracy of entity word extraction.

Description

Entity word extraction method and device and electronic equipment
Technical Field
The embodiment of the disclosure relates to the technical field of computers, in particular to a method and a device for extracting entity words and electronic equipment.
Background
The carriers for communicating information in text information, such as instant messaging (INSTANT MESSAGING, IM) software, document editing applications, mail applications, etc., generally include various abbreviations, product nouns, project nouns, business-specific words, terms, etc., which may be referred to as entity words. Since entity words generally belong to a specific discipline field, it may be difficult for a user to understand text. Thus, mining these entity words and giving corresponding word interpretations may facilitate user understanding of the text.
Disclosure of Invention
This disclosure is provided in part to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The embodiment of the disclosure provides a method, a device and electronic equipment for extracting entity words, which utilize word characteristics to screen candidate entity words and improve the accuracy of entity word extraction.
In a first aspect, an embodiment of the present disclosure provides a method for extracting an entity word, including: acquiring a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set; extracting word characteristics of each candidate entity word in the candidate entity word set; and selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word.
In a second aspect, an embodiment of the present disclosure provides an entity word extracting apparatus, including: the obtaining unit is used for obtaining a text to be processed, and carrying out preset processing on the text to be processed to obtain a candidate entity word set; the extraction unit is used for extracting the word characteristics of each candidate entity word in the candidate entity word set; and the selecting unit is used for selecting the target entity word from the candidate entity word set based on the word characteristics and outputting the target entity word.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the entity word extraction method as in the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the steps of the entity word extraction method as in the first aspect.
According to the entity word extraction method, the entity word extraction device and the electronic equipment, the text to be processed is firstly obtained, and the text to be processed is subjected to preset processing to obtain a candidate entity word set; then, extracting the word characteristics of each candidate entity word in the candidate entity word set; and finally, selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word. The candidate entity words are screened by utilizing the word characteristics in the mode, so that the accuracy of entity word extraction is improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;
FIG. 2 is a flow chart of one embodiment of a method of entity word extraction according to the present disclosure;
FIG. 3 is a flow chart of yet another embodiment of a method of entity word extraction according to the present disclosure;
FIG. 4 is a flow chart of one embodiment of updating a set of candidate entity-words in an entity-word extraction method according to the present disclosure;
FIG. 5 is a schematic diagram of one embodiment of a method of entity word extraction in accordance with the present disclosure;
FIG. 6 is a flow chart of yet another embodiment of a method of entity word extraction according to the present disclosure;
FIG. 7 is a schematic diagram of yet another embodiment of an entity word extraction method according to the present disclosure;
FIG. 8 is a schematic diagram of an embodiment of an entity word extraction device according to the present disclosure;
fig. 9 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
Fig. 1 illustrates an exemplary system architecture 100 in which embodiments of the text processing methods of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 1011, 1012, 1013, a network 102, and a server 103. The network 102 serves as a medium for providing communication links between the terminal devices 1011, 1012, 1013 and the server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 103 via the network 102 using the terminal devices 1011, 1012, 1013 to send or receive messages or the like, e.g. the user may send text to be processed to the server 103 using the terminal devices 1011, 1012, 1013. The terminal devices 1011, 1012, 1013 may have various communication client applications installed thereon, such as an image processing type application, instant messaging software, and the like.
The terminal devices 1011, 1012, 1013 may be hardware or software. When the terminal devices 1011, 1012, 1013 are hardware, they may be various electronic devices having a display screen and supporting information interaction, including but not limited to smart phones, tablet computers, laptop portable computers, and the like. When the terminal devices 1011, 1012, 1013 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server 103 may be a server providing various services. For example, the server 103 may obtain a text to be processed from the terminal devices 1011, 1012, 1013, and perform a preset process on the text to be processed to obtain a candidate entity word set; then, extracting the word characteristics of each candidate entity word in the candidate entity word set; finally, a target entity word may be selected from the candidate entity word set based on the word characteristics, and the target entity word may be output, for example, the target entity word may be output to the terminal devices 1011, 1012, 1013, or the target entity word may be output locally.
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or software module. The present invention is not particularly limited herein.
It should be further noted that, the entity word extraction method provided in the embodiments of the present disclosure may be executed by the server 103, and in this case, the entity word extraction device is generally disposed in the server 103.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a method of entity word extraction according to the present disclosure is shown. The entity word extraction method comprises the following steps:
Step 201, obtaining a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set.
In this embodiment, the execution subject of the entity word extraction method (e.g., the server shown in fig. 1) may acquire the text to be processed. The text to be processed may be a text extracted from the target text, or may be a search sentence (query) input by the user. Here, the target text may be text input by a user.
And then, the execution main body can perform preset processing on the text to be processed to obtain a candidate entity word set. Specifically, the execution subject may perform language recognition on the text to be processed to obtain a language recognition result. The language recognition result may include english text and chinese text. It should be noted that the mixed text of chinese and english may be treated as chinese text.
If the text to be processed is a chinese text, the execution body may perform word segmentation on the text to be processed, for example, a chinese word segmentation method may be used to perform word segmentation on the text to be processed. Then, stop words can be deleted from word results obtained by word segmentation. And then, marking the parts of speech of each word after deleting the stop word, and then, screening the parts of speech of each word to obtain a candidate entity word set. As an example, words may be retained that include, but are not limited to, at least one of the following parts of speech: english, IT technology related vocabulary, academic vocabulary, mathematics related vocabulary, institution related vocabulary, education related vocabulary, government institution vocabulary, factory name, company name, bank, place name, hotel, and name abbreviation.
If the text to be processed is english text, the executing body may normalize punctuation and abbreviations in the text to be processed, for example, convert's into is. And then, analyzing the normalized text to be processed (for example, analyzing the text by using benepar (analyzer based on deep learning training)), and extracting noun blocks conforming to preset rules according to parts of speech to obtain a candidate entity word set, wherein the rules can be reserved nouns and abbreviations.
Step 202, extracting word characteristics of each candidate entity word in the candidate entity word set.
In this embodiment, the execution body may extract word features of each candidate entity word in the candidate entity word set. Here, for each candidate entity word in the candidate entity word set, the execution subject may input the candidate entity word into a feature extraction model trained in advance, to obtain a word feature of the candidate entity word. By way of example, the term features described above may include, but are not limited to, at least one of: word length of a word, part of speech of a word, word vector of a word, and whether a word is a commonly used word.
Step 203, selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word.
In this embodiment, the execution subject may select the target entity word from the candidate entity word set based on the word characteristics. Specifically, the execution body may determine a score of each candidate entity word in the candidate entity word set based on the word characteristics. For example, for each candidate entity word in the candidate entity word set, the execution subject may input the word feature of the candidate entity word into a pre-trained first score prediction model to obtain the score of the candidate entity word. Then, a preset first number (for example, 20) of candidate entity words may be selected from the candidate entity word set according to the order of the scores from the large to the small, and output as target entity words.
Here, the execution subject may output the target entity word to a terminal device of a target user (for example, an auditor), and if the target user audits the target entity word, the target entity word may be added to an existing entity word set.
According to the method provided by the embodiment of the disclosure, the text to be processed is obtained, and preset processing is carried out on the text to be processed, so that a candidate entity word set is obtained; then, extracting the word characteristics of each candidate entity word in the candidate entity word set; and finally, selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word. The candidate entity words are screened by utilizing the word characteristics in the mode, so that the accuracy of entity word extraction is improved.
In some optional implementations, the executing body may perform preset processing on the text to be processed to obtain a candidate entity word set by: the execution body can perform language identification on the text to be processed. Here, the execution subject may input the text to be processed into a pre-trained language recognition model to obtain a language of the text to be processed. The language herein may include, but is not limited to, at least one of the following: chinese and english. If the text to be processed is a chinese text or a chinese-english hybrid text, the execution body may identify the entity word in the text to be processed by using a Named Entity Recognition (NER) technique (NAMED ENTITY). Named entity recognition may also be referred to as "private name recognition," and generally refers to recognition of entities in text that have a particular meaning, including mainly personal names, place names, institution names, proper nouns, and the like. Simply stated, the boundaries and categories of entity fingers in natural text are identified. Here, the name entity word may be removed. And then deleting the entity words existing in the target entity word set from the identified entity words to obtain a candidate entity word set. The target entity word set may be a maintained mined entity word set.
In some optional implementations, the executing body may perform preset processing on the text to be processed to obtain a candidate entity word set by: the execution body can perform language identification on the text to be processed. If the text to be processed is an english text, the executing body may identify the entity word in the text to be processed by using a text matching technique. Here, a regular matching technique may be used, for example, the matching pair is (to-be-matched sentence: "i am writing a related article of a business entity word", to-be-matched word: "business entity word"), and the regular matching may obtain whether the "business entity word" exists in the original sentence and a position where the "business entity word appears, for example, return [5,9] to be the position of the to-be-matched word in the to-be-matched sentence. Here, the name entity word may be removed. And then deleting the entity words existing in the target entity word set from the identified entity words to obtain a candidate entity word set. The target entity word set may be a maintained mined entity word set.
With further reference to fig. 3, a flow 300 of yet another embodiment of a method of entity word extraction is shown. The entity word extraction method flow 300 includes the following steps:
Step 301, obtaining a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set.
Step 302, extracting word characteristics of each candidate entity word in the candidate entity word set.
In this embodiment, steps 301-302 may be performed in a similar manner to steps 201-202, and will not be described again.
Step 303, determining the word weight of each candidate entity word in the candidate entity word set based on the position information of the entity word in the text to be processed.
In this embodiment, the execution body of the entity word extraction method (for example, the server shown in fig. 1) may determine the word weight of each candidate entity word in the candidate entity word set based on the location information of the entity word in the text to be processed.
Specifically, for each candidate entity word in the candidate entity word set, the execution subject may determine a location of the candidate entity word in the text to be processed, where the execution subject may store a correspondence between the location and weights, and if the candidate entity word is located in a plurality of locations in the text to be processed, the execution subject may add weights corresponding to the locations to obtain a word weight of the candidate entity word.
As an example, if the correspondence between the word position and the weight is as follows: text: 1, a step of; title: 0.5; h1 label is 0.3; h2 label is 0.2; and h3 label is 0.1, and if the candidate entity word TCE appears in the text, title, h1 label and h3 label at the same time, the word weight of the candidate entity word TCE is 1.9.
Step 304, for each candidate entity word in the set of candidate entity words, determining a score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word.
In this embodiment, for each candidate entity word in the set of candidate entity words, the execution entity may determine the score of the candidate entity word based on the word feature and the word weight of the candidate entity word. Specifically, the execution subject may input the word characteristics and the word weights of the candidate entity words into a pre-trained second score prediction model to obtain the score of the candidate entity word.
Step 305, selecting a target entity word from the candidate entity word set based on the score of each candidate entity word in the candidate entity word set, and outputting the target entity word.
In this embodiment, the execution entity may select a target entity word from the candidate entity word set based on the score of each candidate entity word in the candidate entity word set and output the target entity word.
As an example, the execution body may select a preset second number of candidate entity words from the candidate entity word set as the target entity word in order of scores from high to low.
As another example, the execution body may select, as the target entity word, a candidate entity word having a score greater than a preset first score threshold from the candidate entity word set.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the flow 300 of the entity word extraction method in this embodiment represents the step of scoring candidate entity words using the word characteristics of the words and the weights of the words. Therefore, the scheme described in the embodiment can further improve the accuracy of entity word extraction.
In some alternative implementations, the text to be processed may be a chinese text or a mixed chinese-english text. At this time, the term characteristics may include an inverse document frequency of the term, a term frequency inverse document frequency of the term, a ratio of N-Gram scores of the term in the to-be-processed text and the target corpus, and a ratio of confusion of the term in the to-be-processed text and the target corpus. The term frequency of the term generally refers to the number of times the term appears in the text to be processed. The inverse document frequency (Inverse Document Frequency, IDF) of the term is typically used to describe the category discrimination capability of the term, and is a measure of the general importance of the term. If the documents containing the words are fewer, the IDFs of the words are larger, and if the documents containing the words are more, the IDFs of the words are smaller. The inverse document frequency of a word can be obtained by dividing the total number of documents (the number of preset external corpora) by the number of documents containing the word, and taking the obtained quotient as a base 10 logarithm. The Term Frequency inverse document Frequency (TF-IDF) of the Term generally refers to the product of the Term Frequency and the Term inverse document Frequency. The main ideas of word frequency inverse document frequency of words are: if a term appears in one article with a high frequency TF and in other articles with few occurrences, the term is considered to have good category discrimination and is suitable for classification. The ratio of the N-Gram score of the term in the text to be processed to the N-Gram score of the term in the target corpus generally refers to the ratio of the N-Gram score of the term in the text to be processed to the N-Gram score of the term in the target corpus (corpus in a preset external corpus such as wikipedia). Here, the N-Gram score is a score that can be calculated by reasoning about the input text (here, terms) based on the N-Gram language model, and represents how often a term is on a corpus, the smaller the value, the rarer, e.g., -100; the larger, the more common, e.g., -1.0. The calculation of the N-Gram score may be supported using KenLM tools, which first train the model on a specified corpus, and then may calculate the score from the model after word input training. The ratio of the confusion of the word in the text to be processed and the target corpus generally refers to the ratio of the confusion of the word in the text to be processed and the confusion of the word in the target corpus. Confusion is typically used to measure how well a probability distribution or model predicts a sample. The confusion degree can be obtained through a pre-trained confusion degree prediction model, namely, the confusion degree of the words in the text is obtained by inputting the words and the text into the confusion degree prediction model.
It should be noted that, the inverse document frequency of the word and the word frequency inverse document frequency of the word are generally used to measure the importance of the word, and the ratio of the N-Gram score of the word in the text to be processed and the target corpus and the ratio of the confusion degree of the word in the text to be processed and the target corpus are generally used to measure the specificity of the word.
In some alternative implementations, the executing entity may determine the score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word by: the execution body may perform weighted summation on the inverse document frequency of the candidate entity word, the word frequency inverse document frequency of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the to-be-processed text and the target corpus, and the ratio of the confusion degree of the candidate entity word in the to-be-processed text and the target corpus, and multiply the summation result with the word weight of the candidate entity word to obtain the score of the candidate entity word. Here, the execution subject may determine the score of the candidate entity word by the following formula (1):
(1)
Wherein Score is the Score of the candidate entity word, tfidf is the inverse document frequency of the candidate entity word, idf is the inverse document frequency of the candidate entity word, ngram_score is the ratio of the N-Gram Score of the candidate entity word in the text to be processed and the target corpus, perlexity _score is the ratio of the confusion degree of the candidate entity word in the text to be processed and the target corpus, k1 is the coefficient corresponding to tfidf of the candidate entity word, k2 is the coefficient corresponding to idf of the candidate entity word, k3 is the coefficient corresponding to ngram_score of the candidate entity word, k4 is the coefficient corresponding to perlexity _score of the candidate entity word, and structure_weight is the word weight of the candidate entity word.
In some alternative implementations, the text to be processed may be english text. At this time, the term characteristics may include a keyword extraction score of the term, a ratio of N-Gram scores of the term in the to-be-processed text and the target corpus, and a ratio of confusion of the term in the to-be-processed text and the target corpus. The keyword extraction scores of the above words generally refer to the calculation of the RAKE scores of the entity words using the RAKE algorithm. The RAKE algorithm is used to extract keywords, which are in fact key terms, and tend to be longer terms, in english keywords typically include multiple words, but rarely contain punctuation and stop words, such as and, the, of, etc., and other words that do not contain semantic information. The RAKE algorithm first uses punctuation marks (e.g., half-angle periods, question marks, exclamation marks, commas, etc.) to divide a document into several clauses, and then for each clause, uses stop words as separators to divide the clause into several phrases that are candidates for the final extracted keywords. The ratio of the N-Gram score of the term in the text to be processed to the N-Gram score of the term in the target corpus generally refers to the ratio of the N-Gram score of the term in the text to be processed to the N-Gram score of the term in the target corpus. The ratio of the confusion of the word in the text to be processed and the target corpus generally refers to the ratio of the confusion of the word in the text to be processed and the confusion of the word in the target corpus. Confusion is typically used to measure how well a probability distribution or model predicts a sample.
It should be noted that, the keyword extraction score of a word is generally used to measure the importance of the word, and the ratio of the N-Gram score of the word in the text to be processed and the target corpus and the ratio of the confusion degree of the word in the text to be processed and the target corpus are generally used to measure the specificity of the word.
In some alternative implementations, the executing entity may determine the score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word by: the execution body may perform weighted summation on the keyword extraction score of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the to-be-processed text and the target corpus, and the ratio of the confusion degree of the candidate entity word in the to-be-processed text and the target corpus, and multiply the summation result with the word weight of the candidate entity word to obtain the score of the candidate entity word. Here, the execution subject may determine the score of the candidate entity word by the following formula (2):
(2)
Wherein Score is the Score of the candidate entity word, rake_score is the Score extracted from the keyword of the candidate entity word, ngram_score is the ratio of N-Gram scores of the candidate entity word in the text to be processed and the target corpus, perlexity _score is the ratio of confusion degree of the candidate entity word in the text to be processed and the target corpus, k1 is the coefficient corresponding to the rake_score of the candidate entity word, k2 is the coefficient corresponding to the ngram_score of the candidate entity word, k3 is the coefficient corresponding to the perlexity _score of the candidate entity word, and structure_weight is the word weight of the candidate entity word.
In some alternative implementations, the text to be processed may be a chinese text or a mixed chinese-english text. The execution body may select the target entity word from the set of candidate entity words based on the score of each candidate entity word in the set of candidate entity words by: the execution body may update the set of candidate entity words based on the score and the word characteristics of each candidate entity word in the set of candidate entity words. The executing body may select entity words, in the candidate entity word set, of which the ratio of the N-Gram score of the word in the text to be processed and the target corpus is greater than a preset first ratio threshold, the ratio of the confusion degree of the word in the text to be processed and the target corpus is greater than a preset second ratio threshold, and the score of the word is greater than a preset second ratio threshold, to generate an updated candidate entity word set; thereafter, a target entity word may be selected from the updated set of candidate entity words. Here, the preset third number of candidate entity words may be selected from the updated candidate entity word set as the target entity word according to the order of the word frequency of the words in the text to be processed from high to low.
With further reference to FIG. 4, a flow 400 of one embodiment of updating a set of candidate entity-words in an entity-word extraction method is illustrated. The updating process 400 for updating the candidate entity word set includes the following steps:
Step 401, based on the candidate entity word set, executing the following entity word selection steps: selecting entity words meeting preset conditions from the candidate entity word set, and combining the entity words meeting the conditions to obtain at least one word combination; determining word combinations appearing in the text to be processed in at least one word combination as candidate compound entity words; determining, for each candidate compound entity word, a score for the candidate compound entity word based on the score of the candidate entity word and the word characteristics of the candidate entity word that make up the candidate compound entity word; updating the candidate entity word set based on the score of the candidate compound entity word, the word characteristics of the candidate compound entity word, the score and the word characteristics of each candidate entity word in the candidate entity word set; and determining whether the updated candidate entity word set is identical to the candidate entity word set.
In this embodiment, step 401 may include sub-steps 4011, 4012, 4013, 4014, and 4015. Wherein:
Step 4011, selecting entity words meeting preset conditions from the candidate entity word set, and combining the entity words meeting the conditions to obtain at least one word combination.
In this embodiment, the execution body (for example, the server shown in fig. 1) of the entity word extraction method may select entity words meeting preset conditions from the candidate entity word set, and combine the entity words meeting the conditions to obtain at least one word combination. The condition may include that a predetermined character length threshold is not exceeded, and at this time, the execution body may select a candidate entity word having a character length not exceeding the character length threshold from the candidate entity word set. For example, if the candidate entity word is chinese, the character length threshold may be set to 4, and the execution entity may select candidate entity words with words not more than 4 from the candidate entity word set. If the candidate entity word is english, the character length threshold may be set to 15, and the execution entity may select candidate entity words having no more than 15 letters included in the word from the candidate entity word set. And then, combining the selected candidate entity words pairwise. For example, if the candidate entity words selected are a and b, two combinations of ab and ba can be obtained.
Step 4012, determining a word combination appearing in the text to be processed in the at least one word combination as a candidate compound entity word, and adding the candidate compound entity word to the candidate entity word set.
In this embodiment, the execution body may determine, as the candidate compound entity word, a word combination that appears in the text to be processed in the at least one word combination. As an example, if the word combination is ab and ba, the execution body may determine whether ab or ba appears in the text to be processed. If the word combination ab appears in the text to be processed, and the word combination ba does not appear in the text to be processed, the word combination ab can be determined to be a candidate compound entity word.
It should be noted that, if the words a and b are english, the execution entity needs to determine whether a+'+b and b+' +a appear in the text to be processed.
The execution body may then add the candidate compound entity word to a set of candidate entity words.
Step 4013, for each candidate compound entity word, determining a score for the candidate compound entity word based on the scores of the candidate entity words that make up the candidate compound entity word.
In this embodiment, for each candidate compound entity word, the execution body may determine the score of the candidate compound entity word based on the scores of the candidate entity words that constitute the candidate compound entity word. Specifically, the execution subject may determine the average value of the scores of the two candidate entity words constituting the candidate compound entity word as the score of the candidate compound entity word.
Step 4014, updating the added candidate entity word set based on the score of the candidate compound entity word, the word characteristics of the candidate compound entity word, the score and the word characteristics of each candidate entity word in the candidate entity word set.
In this embodiment, the execution body may update the added candidate entity word set based on the score of the candidate compound entity word, the word feature of the candidate compound entity word, the score and the word feature of each candidate entity word in the candidate entity word set.
The executing body may select candidate entity words with scores greater than a preset third score threshold from the added candidate entity word set, and then may select a preset fourth number of candidate entity words from the selected candidate entity words as target entity words according to the order of the word frequencies of the words in the text to be processed from large to small.
Step 4015, determining whether the updated set of candidate entity words is the same as the set of candidate entity words.
In this embodiment, the execution body may determine whether the updated candidate entity word set is the same as the candidate entity word set. I.e., determining whether no new candidate compound entity words have been added to the set of candidate entity words.
If the updated candidate entity word set is not the same as the candidate entity word set, the execution body may execute step 402.
Step 402, if the updated candidate entity word set is different from the candidate entity word set, using the updated candidate entity word set as the candidate entity word set, and continuing to execute the entity word selection step.
In this embodiment, if it is determined in step 4015 that the updated candidate entity word set is different from the candidate entity word set, the execution body may use the updated candidate entity word set as the candidate entity word set, and continue to execute the entity word selection step (i.e., steps 4011-4015).
According to the method provided by the embodiment of the disclosure, the candidate compound entity words are obtained by recombining the candidate entity words, the scores of the candidate compound entity words are determined, and the candidate entity word set is updated based on the word characteristics of the candidate entity words and the candidate compound entity words and the scores of the candidate entity words and the candidate compound entity words, so that the situation that the entity words cannot be mined due to word segmentation can be avoided.
In some optional implementations, if it is determined in step 4015 that the updated candidate entity word set is the same as the candidate entity word set, the executing entity may select the target entity word from the updated candidate entity word set based on the score of each candidate entity word in the updated candidate entity word set. Here, the execution subject may select a preset fifth number of candidate entity words from the updated candidate entity word set in order of the scores from the higher score to the lower score as the target entity word.
In some alternative implementations, the term characteristics may include term frequencies of the terms. Here, the word frequency of the candidate entity word may be the number of occurrences of the candidate entity word in the text to be processed. The execution body may determine the score of the candidate compound entity word based on the scores of the candidate entity words constituting the candidate compound entity word by: the execution body may weight and sum the scores of the two candidate entity words that constitute the candidate compound entity word to obtain the score of the candidate compound entity word. Here, for each of the two candidate entity words that make up the candidate compound entity word, the weight corresponding to the candidate entity word may be a ratio of a word frequency of the candidate entity word to a total word frequency, which may be a sum of word frequencies of the two candidate entity words that make up the candidate compound entity word.
Here, the execution subject may determine the score of the candidate compound entity word by the following formula (3):
(3)
Wherein Score is the Score of the candidate compound entity word, And/>Word frequency of two words (a and b) respectively constituting the candidate compound entity word,/>And/>The scores of the two words (a and b) that make up the candidate compound entity word, respectively.
Here, the weight of the word a is the ratio of the word frequency of the word a to the total word frequency (the sum of the word frequency of the word a and the word frequency of the word b), and the weight of the word b is the ratio of the word frequency of the word b to the total word frequency (the sum of the word frequency of the word a and the word frequency of the word b).
In some optional implementations, the executing entity may update the added candidate entity word set based on the score of the candidate compound entity word, the word feature of the candidate compound entity word, the score and the word feature of each candidate entity word in the candidate entity word set, by: the execution body may filter candidate entity words in the added candidate entity word set based on the scores of the candidate entity words, and generate a candidate entity word subset. Specifically, the execution body may order the candidate entity words except the compound entity word in the candidate entity word set according to the order of the scores from small to large, that is, only the fine-grained entity words will be processed. Then, the candidate entity words with a later preset proportion (for example, 20%) may be deleted from the candidate entity word set, so as to obtain a candidate entity word subset. Then, for each candidate compound entity word, the execution body may determine whether at least two candidate entity words constituting the candidate compound entity word exist in the subset of candidate entity words. If so, the execution subject may delete the candidate entity word with the low score from the subset of candidate entity words in the at least two candidate entity words that form the candidate compound entity word.
As an example, for a candidate compound entity word ab, the execution body may determine whether there are two fine-grained entity words a and b that make up ab in the subset of candidate entity words. If the subset of candidate entity words includes the fine-grained entity words a and b, the execution body may compare the scores of the fine-grained entity words a and b, and if the score of the fine-grained entity word b is lower than the score of the fine-grained entity word a, the execution body may delete the fine-grained entity word b from the subset of candidate entity words.
In some alternative implementations, the term characteristics may include a ratio of N-Gram scores of terms in the text to be processed and the target corpus and a ratio of confusion of terms in the text to be processed and the target corpus. After determining, for each candidate compound entity word, whether at least two candidate entity words constituting the candidate compound entity word exist in the subset of candidate entity words, if one candidate entity word constituting the candidate compound entity word exists in the subset of candidate entity words, the execution body may determine whether a first ratio corresponding to the candidate compound entity word is greater than a first ratio corresponding to the candidate entity word constituting the candidate compound entity word, and determine whether a second ratio corresponding to the candidate compound entity word is greater than a second ratio corresponding to the candidate entity word constituting the candidate compound entity word. The first ratio may be a ratio of N-Gram scores of the words in the text to be processed and the target corpus, and the second ratio may be a ratio of confusion of the words in the text to be processed and the target corpus.
If the first ratio corresponding to the candidate compound entity word is greater than the first ratio corresponding to the candidate entity word composing the candidate compound entity word and/or the second ratio corresponding to the candidate compound entity word is greater than the second ratio corresponding to the candidate entity word composing the candidate compound entity word, the executing body may delete the candidate entity word composing the candidate compound entity word from the subset of candidate entity words.
If the first ratio corresponding to the candidate compound entity word is smaller than the first ratio corresponding to the candidate entity word composing the candidate compound entity word and the second ratio corresponding to the candidate compound entity word is smaller than the second ratio corresponding to the candidate entity word composing the candidate compound entity word, the executing body may simultaneously reserve the candidate entity word composing the candidate compound entity word and the candidate compound entity word.
For example, if the subset of candidate entity words is { a, b, c, d, ab }, then { a, b, ab } needs to be integrated. The ratio of the fractions of a, b and ab and the ratio of the N-Gram fractions to the confusion are shown in the following table:
The execution body determines that two fine-grained entity words a and b forming ab exist in the candidate entity word subset, and then the scores of the fine-grained entity words a and b can be compared, and the execution body can delete the fine-grained entity word b from the candidate entity word subset because the score of the fine-grained entity word b is lower than the score of the fine-grained entity word a by 10.0. The execution body may then compare the ratio of the N-Gram scores corresponding to the fine-grained entity word a with the ratio of the N-Gram scores corresponding to the candidate compound entity word ab, and compare the ratio of the confusion corresponding to the fine-grained entity word a with the ratio of the confusion corresponding to the candidate compound entity word ab. Because the ratio of the N-Gram scores corresponding to the candidate compound entity word ab and the ratio of the confusion degree are larger than the ratio of the N-Gram scores corresponding to the fine-grained entity word a and the ratio of the confusion degree, the fine-grained entity word a can be deleted from the candidate entity word subset, and the updated candidate entity word set is { c, d, ab }.
With continued reference to FIG. 5, a schematic diagram of one embodiment of a method of entity word extraction is shown. In fig. 5, the execution body of the entity word extraction method may obtain the user authorization data, and then, may sort the user authorization data into clauses to obtain a plurality of to-be-processed texts, where the plurality of to-be-processed texts include a chinese/english hybrid text and an english text.
Aiming at the Chinese/English mixed text, the execution main body can perform Chinese preprocessing on the Chinese/English mixed text, such as word segmentation, part-of-speech tagging, stop word filtering and part-of-speech screening on the text. Then, the Chinese character of the word can be extracted, and the candidate entity words are scored and sequenced by utilizing the Chinese character. And then, the candidate entity words can be recombined to obtain candidate compound entity words, then the compound word characteristics of the candidate compound entity words are extracted, and the compound word characteristics are utilized to score and sort the candidate compound entity words. Finally, the candidate entity words and the candidate compound entity words may be ranked using the scores of the candidate entity words and the scores of the candidate compound entity words.
For English text, the execution main body can perform English pretreatment on English text, such as standardization treatment on punctuation and abbreviations in the text, analysis on the text to be treated after standardization treatment, and extraction of noun blocks meeting preset rules according to part of speech to obtain a candidate entity word set. And then, extracting English features of the candidate entity words in the candidate entity word set, and grading and sorting the candidate entity words by utilizing the English features.
Finally, the candidate entity words obtained by the Chinese/English entity word mining module and the candidate entity words obtained by the pure English entity mining module can be integrated to obtain target entity words.
With further reference to fig. 6, a flow 600 of yet another embodiment of a method of entity word extraction is shown, comprising the steps of:
And 601, acquiring a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set.
Step 602, extracting word characteristics of each candidate entity word in the candidate entity word set.
In this embodiment, steps 601-602 may be performed in a similar manner to steps 201-202, and will not be described again.
In step 603, text features of the text to be processed are extracted.
In this embodiment, the execution subject (e.g., the server shown in fig. 1) of the entity word extraction method may extract the text features of the text to be processed described above. Here, the text to be processed is typically a search term (query) input by a user, and at this time, the text features may include, but are not limited to, at least one of the following: text retrieval frequency, number of related users of the text, user viscosity of the text, and user penetration of the text. Wherein the text search frequency generally refers to the number of times the text is searched, the number of related users of the text generally refers to the number of users searching for the text, the user viscosity of the text generally refers to the ratio of the number of times the text is searched to the number of users searching for the text, and the user permeability of the text generally refers to the ratio of the number of users searching for the text to the total number of users searching for the text.
Step 604, extracting word features of entity words in the target entity word set.
In this embodiment, the execution body may extract word features of the entity words in the target entity word set. Here, the target entity word set may be a maintained mined entity word set. The term features may include, but are not limited to, at least one of the following: the word length, the word part of speech, the word vector of the word, whether the word is a common word, the word frequency of the word in a preset internal corpus, the word vector of the word in a preset internal corpus, the word frequency of the word in a preset external corpus and the word vector of the word in a preset external corpus.
It should be further noted that, the text corresponding to the entity word in the target entity word set may be the entity word itself. At this time, the word characteristics may further include: frequency of term retrieval, number of related users of the term, user viscosity of the term, and user permeability of the term. Wherein, the term searching frequency generally refers to the number of times the term is searched, the number of related users of the term generally refers to the number of users searching the term, the user viscosity of the term generally refers to the ratio of the number of times the term is searched to the number of users searching the term, and the user permeability of the term generally refers to the ratio of the number of users searching the term to the total number of users searching the term.
Step 605, a feature space is constructed by using the word features of the entity words in the candidate entity word set, the word features and text features of the entity words in the target entity word set.
In this embodiment, the execution body may construct the feature space by using the word features of the entity words in the candidate entity word set, the word features of the entity words in the target entity word set, and the text features. The feature space is the space in which the feature vector is located, and in the feature space, each feature corresponds to one-dimensional coordinates in the feature space. Here, the coordinate system of the feature space may be constructed by a word feature and a text feature, i.e., the physical meaning of the coordinate axes of the feature space and the word feature or the text feature form a one-to-one correspondence.
As an example, if the word characteristics include: the text features include: the execution subject may construct 12 coordinate axes to form a 12-dimensional feature space, and then the execution subject may set, in the constructed feature space, a location point representing the entity word in the candidate entity word set and a location point representing the entity word in the target entity word set according to the word feature and the text feature of each entity word in the candidate entity word set and the word feature of each entity word in the target entity word set.
Step 606, selecting a target entity word from the candidate entity word set based on the feature space, and outputting the target entity word.
In this embodiment, the execution subject may select the target entity word from the candidate entity word set based on the feature space, and output the target entity word. Here, for each candidate entity word in the candidate entity word set, the execution subject may search the feature space for an entity word that exists in the target entity word set and has a closest distance to the candidate entity word, and then determine a distance between the found entity word and the candidate entity word, and if the distance is greater than the target distance, determine the candidate entity word as the target entity word.
As can be seen from fig. 6, compared with the embodiment corresponding to fig. 2, the flow 600 of the entity word extraction method in this embodiment embodies the steps of constructing a feature space and selecting a target entity word in the feature space. Therefore, the scheme described in the embodiment can further improve the accuracy of entity word extraction.
In some optional implementations, the execution entity may select the target entity word from the candidate entity word set based on the feature space by: the execution body may classify the words in the feature space using a K-Nearest Neighbor (KNN) classification algorithm. The thought of the K nearest neighbor classification algorithm is as follows: in the feature space, if most of the k nearest (i.e., nearest in the feature space) samples near a sample belong to a certain class, then that sample also belongs to that class. Then, for each candidate entity word in the candidate entity word set, the execution subject may determine, in the feature space, a number of entity words existing in the target entity word set, for which a distance between the execution subject and the candidate entity word is less than a preset distance threshold; then, it may be determined whether the number is greater than a preset number threshold (e.g., 10); if yes, the execution subject may determine the candidate entity word as the target entity word.
With continued reference to fig. 7, a schematic diagram of yet another embodiment of a method of entity word extraction is shown. In fig. 7, the execution body of the entity word extraction method may obtain query data, and then, may classify the query data to obtain a chinese query and an english query.
For the Chinese query, the execution subject may use NER (named entity recognition technology) to recognize entity words in the Chinese query, and then remove name entity words in the recognized entity words to obtain candidate Chinese entity words.
For the english query, the executing body may identify the entity words in the english query by using an english business entity word identifying method (text matching technique), and then remove the name entity words and the common words in the identified entity words to obtain candidate english entity words.
And then, the execution main body can combine the candidate Chinese entity words and the candidate English entity words, and filter the word list of the existing entity words to remove the entity words which are already recorded. Extracting features aiming at candidate entity words which are not in the existing entity word list, identifying the candidate entity words by utilizing the extracted features, and adding the identified entity words into the entity words of the enterprises.
Here, when extracting features of candidate entity words, it is necessary to use corpora inside and outside corpora to extract word features such as word frequencies of the candidate entity words in internal predictions, word frequencies of the candidate entity words in external predictions, word vectors of the candidate entity words in internal predictions, word vectors of the candidate entity words in external predictions, and the like.
With further reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an entity word extraction apparatus, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 8, the entity word extraction apparatus 800 of the present embodiment includes: an acquisition unit 801, a first extraction unit 802, and a selection unit 803. The obtaining unit 801 is configured to obtain a text to be processed, and perform preset processing on the text to be processed to obtain a candidate entity word set; the first extracting unit 802 is configured to extract word features of each candidate entity word in the candidate entity word set; the selecting unit 803 is configured to select a target entity word from the candidate entity word set based on the word feature, and output the target entity word.
In the present embodiment, specific processes of the acquisition unit 801, the first extraction unit 802, and the selection unit 803 of the entity word extraction apparatus 800 may refer to steps 201, 202, and 203 in the corresponding embodiment of fig. 2.
In some alternative implementations, the selecting unit 803 may be further configured to select the target entity word from the candidate entity word set based on the word characteristics by: the selecting unit 803 may determine a word weight of each candidate entity word in the candidate entity word set based on the position information of the entity word in the text to be processed; then, for each candidate entity word in the candidate entity word set, determining the score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word; then, a target entity word may be selected from the set of candidate entity words based on the scores of each candidate entity word in the set of candidate entity words.
In some optional implementations, the text to be processed is a Chinese text or a Chinese-English mixed text, and the word characteristics include inverse document frequency of words, word frequency inverse document frequency of words, a ratio of N-Gram scores of the words in the text to be processed and the target corpus, and a ratio of confusion of the words in the text to be processed and the target corpus; and the selecting unit 803 may be further configured to determine the score of the candidate entity word based on the word feature and the word weight of the candidate entity word by: the selecting unit 803 may perform weighted summation on the inverse document frequency of the candidate entity word, the word frequency inverse document frequency of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the to-be-processed text and the target corpus, and the ratio of the confusion degree of the candidate entity word in the to-be-processed text and the target corpus, and multiply the summation result with the word weight of the candidate entity word to obtain the score of the candidate entity word.
In some optional implementations, the text to be processed is an english text, and the term features include a keyword extraction score of the term, a ratio of N-Gram scores of the term in the text to be processed and the target corpus, and a ratio of confusion of the term in the text to be processed and the target corpus; and the selecting unit 803 may be further configured to determine the score of the candidate entity word based on the word feature and the word weight of the candidate entity word by: the selecting unit 803 may perform weighted summation on the keyword extraction score of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the text to be processed and the target corpus, and the ratio of the confusion degree of the candidate entity word in the text to be processed and the target corpus, and multiply the summation result with the word weight of the candidate entity word to obtain the score of the candidate entity word.
In some alternative implementations, the text to be processed is a chinese text or a chinese-english hybrid text; and the selecting unit 803 may be further configured to select the target entity word from the candidate entity word set based on the score of each candidate entity word in the candidate entity word set by: the selecting unit 803 may update the candidate entity word set based on the score and the word characteristics of each candidate entity word in the candidate entity word set, and select the target entity word from the updated candidate entity word set.
In some optional implementations, the selecting unit 803 may be further configured to update the candidate entity word set based on the score and the word feature of each candidate entity word in the candidate entity word set, and select the target entity word from the updated candidate entity word set by: the selecting unit 803 may perform the following entity-word selecting step based on the candidate entity-word set: selecting entity words meeting preset conditions from the candidate entity word set, and combining the entity words meeting the conditions to obtain at least one word combination; determining word combinations in the text to be processed in at least one word combination as candidate compound entity words, and adding the candidate compound entity words into a candidate entity word set; determining, for each candidate compound entity word, a score for the candidate compound entity word based on the scores of the candidate entity words that make up the candidate compound entity word; updating the added candidate entity word set based on the score of the candidate compound entity word, the word characteristics of the candidate compound entity word, the score and the word characteristics of each candidate entity word in the candidate entity word set; determining whether the updated candidate entity word set is the same as the candidate entity word set; if not, taking the updated candidate entity word set as the candidate entity word set, and continuing to execute the entity word selection step.
In some optional implementations, if the updated candidate entity word set is the same as the candidate entity word set, the selecting unit 803 may select the target entity word from the updated candidate entity word set based on the score of each candidate entity word in the updated candidate entity word set.
In some alternative implementations, the term characteristics include term frequency of the term; and the selecting unit 803 may be further configured to determine the score of the candidate compound entity word based on the scores of the candidate entity words constituting the candidate compound entity word by: the selecting unit 803 may perform weighted summation on the scores of the two candidate entity words that form the candidate compound entity word to obtain the score of the candidate compound entity word, where, for each candidate entity word in the two candidate entity words that form the candidate compound entity word, the weight corresponding to the candidate entity word is the ratio of the word frequency of the candidate entity word to the total word frequency, and the total word frequency is the sum of the word frequencies of the two candidate entity words that form the candidate compound entity word.
In some optional implementations, the selecting unit 803 may be further configured to update the added candidate entity word set based on the score of the candidate compound entity word, the word feature of the candidate compound entity word, the score and the word feature of each candidate entity word in the candidate entity word set, as follows: the selecting unit 803 may screen candidate entity words in the added candidate entity word set based on the score of each candidate entity word, to generate a candidate entity word subset; then, for each candidate compound entity word, determining whether at least two candidate entity words forming the candidate compound entity word exist in the candidate entity word subset; if yes, deleting the candidate entity words with low scores in at least two candidate entity words forming the candidate compound entity word from the candidate entity word subset.
In some alternative implementations, the term characteristics include a ratio of N-Gram scores of the terms in the text to be processed and the target corpus and a ratio of confusion of the terms in the text to be processed and the target corpus; and the selecting unit 803 may be further configured to update the added candidate entity word set based on the score of the candidate compound entity word, the word feature of the candidate compound entity word, the score and the word feature of each candidate entity word in the candidate entity word set, by: if a candidate entity word composing the candidate compound entity word exists in the candidate entity word subset, determining whether a first ratio corresponding to the candidate compound entity word is larger than a first ratio corresponding to the candidate entity word composing the candidate compound entity word, and determining whether a second ratio corresponding to the candidate compound entity word is larger than a second ratio corresponding to the candidate entity word composing the candidate compound entity word, wherein the first ratio is a ratio of N-Gram scores of words in a text to be processed and a target corpus, and the second ratio is a ratio of confusion degrees of the words in the text to be processed and the target corpus; and deleting the candidate entity words forming the candidate compound entity words from the candidate entity word subset if the first ratio corresponding to the candidate compound entity words is greater than the first ratio corresponding to the candidate entity words forming the candidate compound entity words and/or the second ratio corresponding to the candidate compound entity words is greater than the second ratio corresponding to the candidate entity words forming the candidate compound entity words.
In some optional implementations, the obtaining unit 801 may be further configured to perform a preset process on the text to be processed to obtain a candidate entity word set by: the obtaining unit 801 may perform language recognition on the text to be processed; if the text to be processed is a Chinese text or a Chinese-English mixed text, identifying entity words in the text to be processed by using a named entity identification technology, and deleting entity words in the target entity word set from the identified entity words to obtain a candidate entity word set.
In some optional implementations, the obtaining unit 801 may be further configured to perform a preset process on the text to be processed to obtain a candidate entity word set by: the obtaining unit 801 may perform language recognition on the text to be processed; if the text to be processed is an English text, identifying entity words in the text to be processed by using a text matching technology, and removing entity words in the target entity word set from the identified entity words to obtain a candidate entity word set.
In some alternative implementations, the entity word extraction apparatus 800 may further include a second extraction unit (not shown in the figure) and a third extraction unit (not shown in the figure). The second extracting unit may extract text features of the text to be processed; the third extraction unit may extract word features of the entity words in the target entity word set. The selecting unit 803 may be further configured to select the target entity word from the candidate entity word set based on the word characteristics by: the selecting unit 803 may construct a feature space by using the word features of the entity words in the candidate entity word set, the word features of the entity words in the target entity word set, and the text features; thereafter, a target entity word may be selected from the set of candidate entity words based on the feature space.
In some alternative implementations, the selecting unit 803 may be further configured to select the target entity word from the candidate entity word set based on the feature space by: for each candidate entity word in the candidate entity word set, the selecting unit 803 may determine, in the feature space, the number of entity words existing in the target entity word set, in which the distance between the candidate entity word and the candidate entity word is smaller than a preset distance threshold; thereafter, it may be determined whether the number is greater than a preset number threshold; if so, the candidate entity word may be determined to be the target entity word.
Referring now to fig. 9, a schematic diagram of an electronic device (e.g., server in fig. 1) 900 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 9 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 9, the electronic device 900 may include a processing means (e.g., a central processor, a graphics processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are also stored. The processing device 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
In general, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the electronic device 900 to communicate wirelessly or by wire with other devices to exchange data. While fig. 9 shows an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 9 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 909, or installed from the storage device 908, or installed from the ROM 902. When executed by the processing device 901, performs the above-described functions defined in the methods of the embodiments of the present disclosure. It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set; extracting word characteristics of each candidate entity word in the candidate entity word set; and selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word.
Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments described in the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, an extraction unit, and a selection unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the first extraction unit may also be described as "a unit that extracts word features of respective candidate entity words in the candidate entity word set".
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims (15)

1. The entity word extraction method is characterized by comprising the following steps of:
acquiring a text to be processed, and carrying out preset processing on the text to be processed to obtain a candidate entity word set;
Extracting word characteristics of each candidate entity word in the candidate entity word set;
Selecting a target entity word from the candidate entity word set based on the word characteristics, and outputting the target entity word, wherein the method comprises the following steps:
Determining the word weight of each candidate entity word in the candidate entity word set based on the position information of the entity word in the text to be processed;
determining, for each candidate entity word in the set of candidate entity words, a score for the candidate entity word based on the word characteristics and the word weights of the candidate entity word;
Selecting a target entity word from the set of candidate entity words based on the score of each candidate entity word in the set of candidate entity words;
The text to be processed is an English text, and the word characteristics comprise keyword extraction scores of words, a ratio of N-Gram scores of the words in the text to be processed and a target corpus and a ratio of confusion of the words in the text to be processed and the target corpus; and
The determining the score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word comprises the following steps:
And carrying out weighted summation on the keyword extraction score of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the text to be processed and the target corpus and the ratio of the confusion degree of the candidate entity word in the text to be processed and the target corpus, and multiplying the summation result by the word weight of the candidate entity word to obtain the score of the candidate entity word.
2. The method according to claim 1, wherein the text to be processed is chinese text or chinese-english mixed text, and the word characteristics include word inverse document frequency, ratio of N-Gram score of word in the text to be processed and target corpus, and ratio of confusion of word in the text to be processed and target corpus; and
The determining the score of the candidate entity word based on the word characteristics and the word weights of the candidate entity word comprises the following steps:
and carrying out weighted summation on the inverse document frequency of the candidate entity word, the word frequency inverse document frequency of the candidate entity word, the ratio of N-Gram scores of the candidate entity word in the text to be processed and the target corpus and the ratio of confusion degree of the candidate entity word in the text to be processed and the target corpus, and multiplying the summation result with the word weight of the candidate entity word to obtain the score of the candidate entity word.
3. The method according to claim 1, wherein the text to be processed is a chinese text or a chinese-english mixed text; and
The selecting a target entity word from the candidate entity word set based on the score of each candidate entity word in the candidate entity word set includes:
updating the candidate entity word set based on the score and the word characteristics of each candidate entity word in the candidate entity word set, and selecting a target entity word from the updated candidate entity word set.
4. The method of claim 3, wherein the updating the set of candidate entity words based on the score and word characteristics of each candidate entity word in the set of candidate entity words, and selecting the target entity word from the updated set of candidate entity words, comprises:
Based on the candidate entity word set, the following entity word selection steps are executed: selecting entity words meeting preset conditions from the candidate entity word set, and combining the entity words meeting the conditions to obtain at least one word combination; determining word combinations appearing in the text to be processed in the at least one word combination as candidate compound entity words, and adding the candidate compound entity words into a candidate entity word set; determining, for each candidate compound entity word, a score for the candidate compound entity word based on the scores of the candidate entity words that make up the candidate compound entity word; updating the added candidate entity word set based on the score of the candidate compound entity word, the word characteristics of the candidate compound entity word, the score and the word characteristics of each candidate entity word in the candidate entity word set; determining whether the updated candidate entity word set is the same as the candidate entity word set;
If not, taking the updated candidate entity word set as the candidate entity word set, and continuing to execute the entity word selection step.
5. The method of claim 4, wherein after said determining if the updated set of candidate entity words is the same as the set of candidate entity words, the method further comprises:
if so, selecting a target entity word from the updated candidate entity word set based on the scores of the candidate entity words in the updated candidate entity word set.
6. The method of claim 4, wherein the term characteristics include term frequencies of terms; and
The determining the score of the candidate compound entity word based on the score of the candidate entity word composing the candidate compound entity word comprises:
And carrying out weighted summation on the scores of the two candidate entity words forming the candidate compound entity word to obtain the score of the candidate compound entity word, wherein the weight corresponding to each candidate entity word in the two candidate entity words forming the candidate compound entity word is the ratio of the word frequency of the candidate entity word to the total word frequency, and the total word frequency is the sum of the word frequencies of the two candidate entity words forming the candidate compound entity word.
7. The method of claim 4, wherein updating the added set of candidate entity words based on the score of the candidate compound entity word, the word characteristics of the candidate compound entity word, the score and the word characteristics of each candidate entity word in the set of candidate entity words comprises:
Screening the candidate entity words in the added candidate entity word set based on the scores of the candidate entity words to generate a candidate entity word subset;
Determining, for each candidate compound entity word, whether there are at least two candidate entity words constituting the candidate compound entity word in the subset of candidate entity words;
If yes, deleting the candidate entity words with low scores in at least two candidate entity words forming the candidate compound entity word from the candidate entity word subset.
8. The method of claim 7, wherein the term characteristics include a ratio of N-Gram scores of terms in the text to be processed and a target corpus and a ratio of confusion of terms in the text to be processed and the target corpus; and
After determining, for each candidate compound entity word, whether there are at least two candidate entity words in the subset of candidate entity words that make up the candidate compound entity word, the method further includes:
If one candidate entity word forming the candidate compound entity word exists in the candidate entity word subset, determining whether a first ratio corresponding to the candidate compound entity word is larger than a first ratio corresponding to the candidate entity word forming the candidate compound entity word, and determining whether a second ratio corresponding to the candidate compound entity word is larger than a second ratio corresponding to the candidate entity word forming the candidate compound entity word, wherein the first ratio is a ratio of N-Gram scores of words in the text to be processed and the target corpus, and the second ratio is a ratio of confusion of the words in the text to be processed and the target corpus;
And deleting the candidate entity words forming the candidate compound entity words from the subset of candidate entity words if the first ratio corresponding to the candidate compound entity words is greater than the first ratio corresponding to the candidate entity words forming the candidate compound entity words and/or the second ratio corresponding to the candidate compound entity words is greater than the second ratio corresponding to the candidate entity words forming the candidate compound entity words.
9. The method of claim 1, wherein the performing preset processing on the text to be processed to obtain a candidate entity word set includes:
performing language identification on the text to be processed;
if the text to be processed is a Chinese text or a Chinese-English mixed text, identifying entity words in the text to be processed by using a named entity identification technology, and deleting entity words in a target entity word set from the identified entity words to obtain a candidate entity word set.
10. The method of claim 1, wherein the performing preset processing on the text to be processed to obtain a candidate entity word set includes:
performing language identification on the text to be processed;
If the text to be processed is English text, identifying entity words in the text to be processed by using a text matching technology, and removing entity words in the target entity word set from the identified entity words to obtain a candidate entity word set.
11. The method of claim 1, wherein prior to said selecting a target entity word from said set of candidate entity words based on said word characteristics, said method further comprises:
extracting text characteristics of the text to be processed;
extracting word characteristics of entity words in the target entity word set; and
The selecting, based on the word characteristics, a target entity word from the candidate entity word set, including:
constructing a feature space by utilizing the word features of the entity words in the candidate entity word set, the word features of the entity words in the target entity word set and the text features;
And selecting a target entity word from the candidate entity word set based on the feature space.
12. The method of claim 11, wherein the selecting a target entity word from the set of candidate entity words based on the feature space comprises:
For each candidate entity word in the candidate entity word set, determining the number of entity words existing in the target entity word set, wherein the distance between the candidate entity word and the candidate entity word is smaller than a preset distance threshold value;
determining whether the number is greater than a preset number threshold;
if yes, the candidate entity word is determined to be the target entity word.
13. An entity word extraction device, comprising:
The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be processed, and performing preset processing on the text to be processed to obtain a candidate entity word set;
The first extraction unit is used for extracting the word characteristics of each candidate entity word in the candidate entity word set;
The selecting unit is configured to select a target entity word from the candidate entity word set based on the word feature, and output the target entity word, and includes:
Determining the word weight of each candidate entity word in the candidate entity word set based on the position information of the entity word in the text to be processed;
determining, for each candidate entity word in the set of candidate entity words, a score for the candidate entity word based on the word characteristics and the word weights of the candidate entity word;
Selecting a target entity word from the set of candidate entity words based on the score of each candidate entity word in the set of candidate entity words;
The text to be processed is an English text, and the word characteristics comprise keyword extraction scores of words, a ratio of N-Gram scores of the words in the text to be processed and a target corpus and a ratio of confusion of the words in the text to be processed and the target corpus; and
The selecting unit is further configured to determine a score of the candidate entity word based on the word feature and the word weight of the candidate entity word by:
And carrying out weighted summation on the keyword extraction score of the candidate entity word, the ratio of the N-Gram score of the candidate entity word in the text to be processed and the target corpus and the ratio of the confusion degree of the candidate entity word in the text to be processed and the target corpus, and multiplying the summation result by the word weight of the candidate entity word to obtain the score of the candidate entity word.
14. An electronic device, comprising:
one or more processors;
A storage device having one or more programs stored thereon,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-12.
15. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-12.
CN202111007981.9A 2021-08-30 2021-08-30 Entity word extraction method and device and electronic equipment Active CN113743107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111007981.9A CN113743107B (en) 2021-08-30 2021-08-30 Entity word extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111007981.9A CN113743107B (en) 2021-08-30 2021-08-30 Entity word extraction method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113743107A CN113743107A (en) 2021-12-03
CN113743107B true CN113743107B (en) 2024-06-21

Family

ID=78734050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111007981.9A Active CN113743107B (en) 2021-08-30 2021-08-30 Entity word extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113743107B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814486A (en) * 2020-07-10 2020-10-23 东软集团(上海)有限公司 Enterprise client tag generation method, system and device based on semantic analysis
CN112148881A (en) * 2020-10-22 2020-12-29 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495B (en) * 2011-04-11 2014-04-02 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN107153658A (en) * 2016-03-03 2017-09-12 常州普适信息科技有限公司 A kind of public sentiment hot word based on weighted keyword algorithm finds method
CN108763196A (en) * 2018-05-03 2018-11-06 上海海事大学 A kind of keyword extraction method based on PMI
CN109062895B (en) * 2018-07-23 2022-06-24 挖财网络技术有限公司 Intelligent semantic processing method
CN110188344A (en) * 2019-04-23 2019-08-30 浙江工业大学 A kind of keyword extracting method of multiple features fusion
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN110472240A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Text feature and device based on TF-IDF
CN110874530B (en) * 2019-10-30 2023-06-13 深圳价值在线信息科技股份有限公司 Keyword extraction method, keyword extraction device, terminal equipment and storage medium
GB2601517A (en) * 2020-12-02 2022-06-08 Silver Bullet Media Services Ltd A method, apparatus and program for classifying subject matter of content in a webpage
CN112905771A (en) * 2021-02-10 2021-06-04 北京邮电大学 Characteristic keyword extraction method based on part of speech and position
CN113268995B (en) * 2021-07-19 2021-11-19 北京邮电大学 Chinese academy keyword extraction method, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814486A (en) * 2020-07-10 2020-10-23 东软集团(上海)有限公司 Enterprise client tag generation method, system and device based on semantic analysis
CN112148881A (en) * 2020-10-22 2020-12-29 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN112395395A (en) * 2021-01-19 2021-02-23 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113743107A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
US20200073996A1 (en) Methods and Systems for Domain-Specific Disambiguation of Acronyms or Homonyms
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN112188312B (en) Method and device for determining video material of news
WO2023024975A1 (en) Text processing method and apparatus, and electronic device
Pinto et al. Real time sentiment analysis of political twitter data using machine learning approach
CN114357117A (en) Transaction information query method and device, computer equipment and storage medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN112632285A (en) Text clustering method and device, electronic equipment and storage medium
CN112506864A (en) File retrieval method and device, electronic equipment and readable storage medium
Nasser et al. n-Gram based language processing using Twitter dataset to identify COVID-19 patients
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN109902152B (en) Method and apparatus for retrieving information
CN107766498A (en) Method and apparatus for generating information
Bokolo et al. Cyberbullying detection on social media using machine learning
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
Islam et al. An in-depth exploration of Bangla blog post classification
Hussain et al. A technique for perceiving abusive bangla comments
CN111555960A (en) Method for generating information
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113743107B (en) Entity word extraction method and device and electronic equipment
CN114742058B (en) Named entity extraction method, named entity extraction device, computer equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant