CN110837730B - Method and device for determining unknown entity vocabulary - Google Patents

Method and device for determining unknown entity vocabulary Download PDF

Info

Publication number
CN110837730B
CN110837730B CN201911066527.3A CN201911066527A CN110837730B CN 110837730 B CN110837730 B CN 110837730B CN 201911066527 A CN201911066527 A CN 201911066527A CN 110837730 B CN110837730 B CN 110837730B
Authority
CN
China
Prior art keywords
entity
vocabulary
candidate
word
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911066527.3A
Other languages
Chinese (zh)
Other versions
CN110837730A (en
Inventor
付骁弈
徐猛
张�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911066527.3A priority Critical patent/CN110837730B/en
Publication of CN110837730A publication Critical patent/CN110837730A/en
Application granted granted Critical
Publication of CN110837730B publication Critical patent/CN110837730B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/08Auctions

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Accounting & Taxation (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Educational Administration (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method and a device for determining unknown entity words, wherein the method comprises the following steps: acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. The method for determining the unknown entity words can judge whether the unknown words are entity words or not while finding the unknown words, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown words are entity words or not so as to improve the accuracy of determining the unknown entity words.

Description

Method and device for determining unknown entity vocabulary
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for determining an unknown entity vocabulary.
Background
Natural language processing (natural language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. In the category of natural language processing technology, the discovery of unknown entity words is a basic task for judging which character fragments may belong to the unknown entity words in a batch of corpus.
In the industries of bid monitoring, risk early warning and the like, the unknown entity vocabulary has wide application prospect, and plays a decisive role in downstream task precision such as word segmentation and the like. However, in the prior art, it is often impossible to determine whether an unknown word is an entity word or not, or the target accuracy cannot be achieved in a specific field.
Therefore, an accurate determination of unknown entity vocabulary is a current urgent problem to be solved.
Disclosure of Invention
In view of this, an object of the present application is to provide a method and an apparatus for determining an unknown entity vocabulary, which determine whether the unknown vocabulary is an entity vocabulary while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while determining whether the unknown vocabulary is an entity vocabulary, so as to improve accuracy of determining the unknown entity vocabulary.
In a first aspect, an embodiment of the present application provides a method for determining an unknown entity vocabulary, including:
acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set;
generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set;
and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies.
In an optional implementation manner, the generating a domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
In an alternative embodiment, the result of the word segmentation process includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
based on the word segmentation processing result, determining a plurality of candidate unknown entity words, including:
based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
any of the candidate unknown entity words belongs to the candidate entity word set.
In an alternative embodiment, the determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies includes:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
If the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
In a second aspect, an embodiment of the present application further provides a device for determining an unknown entity vocabulary, where the device for determining an unknown entity vocabulary includes: the device comprises an acquisition module, a generation module, a processing module and a determination module, wherein:
the acquisition module is used for acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set;
the generation module is used for generating a domain word stock based on a plurality of linguistic data belonging to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
The processing module is used for carrying out word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock, and determining a plurality of candidate unknown entity vocabularies based on the word segmentation processing result; any one of the candidate unknown entity words belongs to the candidate entity word set;
and the determining module is used for determining at least one target unknown entity vocabulary from the candidate unknown entity vocabularies.
In an optional implementation manner, the generating module is specifically configured to, when generating a domain lexicon based on a plurality of corpora that belong to the same domain as the text to be analyzed:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
In an alternative embodiment, the result of the word segmentation process includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
the determining module is specifically configured to, when determining a plurality of candidate unknown entity vocabularies based on a result of the word segmentation process:
Based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
any of the candidate unknown entity words belongs to the candidate entity word set.
In an alternative embodiment, the determining module is specifically configured to, when determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
Under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
In a third aspect, embodiments of the present application further provide an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.
In a fourth aspect, the embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.
According to the method and the device for determining the unknown entity vocabulary, the text to be analyzed is obtained, the text to be analyzed is input into the entity recognition model trained in advance, and the entity vocabulary in the text to be analyzed is recognized to obtain the candidate entity vocabulary set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. Compared with the technical method for determining the unknown entity vocabulary in the prior art, the method for determining the unknown entity vocabulary can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart illustrating a method for determining unknown entity words provided by an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for verifying correctness of unknown entity vocabulary according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a determining device for determining an unknown entity vocabulary according to an embodiment of the present application;
fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
According to research, in the existing method, a new word discovery model based on the aggregation degree of N words (N-gram) is generally adopted to determine unknown words, and the method can only generally construct word libraries, and although whether the unknown words are new words can be judged by comparing the existing word libraries, more information about the new words, such as whether the information is entity words, and the like, cannot be given; in addition, the existing method belongs to a general statistic-based method, and has great limitation on accuracy of judging whether an unknown word is an entity word or not in a specific field.
Therefore, in the prior art, it is often impossible to determine whether an unknown word is an entity word or not, or the target accuracy cannot be achieved in a specific field.
Based on the above research, the present application provides a method and an apparatus for determining an unknown entity vocabulary, which can obtain a text to be analyzed, input the text to be analyzed into a pre-trained entity recognition model, and recognize the entity vocabulary in the text to be analyzed to obtain a candidate entity vocabulary set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. The method can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.
The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
For the sake of understanding the present embodiment, first, a method for determining an unknown entity vocabulary disclosed in the present embodiment of the present application will be described in detail. In particular, its execution subject may also be other computer devices.
Example 1
Referring to fig. 1, a flowchart of a method for determining an unknown entity vocabulary according to an embodiment of the present application is shown, where the method includes steps S101 to S104, where:
s101: and acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set.
S102: generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus.
S103: based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set.
S104: and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies.
The following describes the above-mentioned steps S101 to S104 in detail.
And (3) a step of: in S101, a text to be analyzed is obtained, and the text to be analyzed is input into a pre-trained entity recognition model, and entity vocabularies in the text to be analyzed are recognized to obtain a candidate entity vocabulary set.
The entity recognition model adopted in the application should complete training on the domain corpus marked by the expert in advance.
By way of example, the entity recognition model used in the present application may employ a neural network model based on a Long Short-Term Memory (LSTM) and a conditional random field (conditional random field, CRF), but is not limited to the above model.
When the entity vocabulary in the text to be analyzed is identified, a BIO labeling system can be adopted, wherein each letter represents (B-begin, I-insert, O-other) respectively.
Illustratively, after obtaining a piece of text to be analyzed, e.g., "… successfully following 47.24% of the equity owner's xx securities with 30 years history, guangzhou development area xx controlled-ply company limited has devised a name featuring Guangzhou: knowledge of the securities. ".
Where "xx securities" may be labeled as B_subj I_subj and "Guangzhou development area xx controlled group Co., ltd" may be labeled as: b_subjj i_subjj I/u subjj i_subjj i_subj, while the other words are marked O.
Thus, the entity vocabulary in the text to be analyzed can be marked out, and the rest operation is performed.
Different from a named entity recognition method on a general corpus, the named entity recognition method only marks entity vocabularies, and does not distinguish between entity types such as person names, place names, mechanism names, proper nouns and the like.
After the above operations are completed, the candidate entity vocabulary set may be output.
And II: in the step S102, a domain word stock is generated based on a plurality of corpora belonging to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus.
Illustratively, a new word discovery model based on N-word (N-gram) aggregation is run in a massive domain corpus. The model is also called Chinese language model, can utilize collocation information between adjacent words in the context, when the continuous Chinese phonetic alphabets without spaces, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection of a user is not required, and the problem of repeated codes of a plurality of Chinese characters corresponding to the same Chinese phonetic alphabets (or stroke strings or number strings) is avoided. The model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus.
For example, the massive domain corpus and the text to be analyzed in S101 should belong to the same domain, and at the same time, the scale of the massive domain corpus should be far greater than that of the text to be analyzed in S101.
In the step, entity recognition processing is carried out on each corpus to obtain entity vocabularies included in each corpus; and forming the domain word library based on the entity words included in each corpus.
After the above operation is completed, a domain word stock corresponding to the text to be analyzed in the above S101 may be output.
Thirdly,: in S103, word segmentation is performed on the corpus based on the candidate entity vocabulary set and the domain vocabulary library, and a plurality of candidate unknown entity vocabularies are determined based on the result of the word segmentation; any one of the candidate unknown entity words belongs to the candidate entity word set.
Based on the candidate entity vocabulary set output in the step S101 and the domain vocabulary library output in the step S102, combining the candidate entity vocabulary set and the domain vocabulary library to construct a corresponding vocabulary, inputting the constructed vocabulary into a chinese word segmentation tool, and performing word segmentation processing to obtain a word segmentation processing result. Because Chinese has its specificity in basic grammar, it is necessary to recombine a continuous word sequence into a word sequence according to a certain specification, and this process is called word segmentation.
For example, the vocabulary constructed as described above may be input into the chinese word segmentation tool Jieba. Jieba is one of the most powerful tool kits used in chinese natural language processing, and can implement various functions including word segmentation, part-of-speech tagging, named entity recognition, and the like.
Wherein, the word segmentation processing result comprises: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock.
After the word segmentation result is obtained, a plurality of candidate unknown entity words can be determined based on the word segmentation result.
The word frequency statistics can be carried out on each word segmentation vocabulary appearing in each corpus based on a word frequency reverse file frequency TF-IDF method, so that the frequency of each word segmentation vocabulary appearing in each corpus is obtained;
for example, TF-IDF is a statistical method used to evaluate the importance of a word to a document in a corpus or corpus, where a text is analyzed.
Wherein the importance of the vocabulary increases proportionally with the number of occurrences in the text to be analyzed, but at the same time decreases inversely with the frequency of occurrences in the corpus. For example, a word "may appear a large number of times in the text to be analyzed, but it also appears in a corpus with a high frequency, and thus is of little importance.
And determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus.
The word segmentation vocabulary after word frequency statistics by using the method of word frequency reverse file frequency TF-IDF is sorted, and a plurality of vocabulary with the lowest word frequency are selected as candidate unknown entity vocabulary.
And after the steps are completed, the candidate unknown entity vocabulary can be output.
Fourth, the method comprises the following steps: in S104, at least one target unknown entity vocabulary is determined from among the plurality of candidate unknown entity vocabularies.
Wherein, the correctness of the unknown entity vocabulary is judged for the candidate unknown entity vocabulary output in the step S103.
Illustratively, the candidate unknown entity vocabulary output in the step S103 is sampled by a self-help method, and the correctness of the unknown entity vocabulary is judged for the sampled candidate unknown entity vocabulary.
Referring to fig. 2, a flowchart of a method for checking correctness of unknown entity words provided in an embodiment of the present application is shown, where the method for checking correctness of unknown entity words performs correctness of unknown entity words with respect to the candidate unknown entity words sampled in fig. 1, and the method for checking correctness of unknown entity words includes steps S201 to S205, where:
S201: randomly determining at least one candidate unknown entity word from a plurality of candidate unknown entity words as a verification word.
S202: and under the condition that the verification vocabulary does not form an entity vocabulary, re-selecting sample data to re-train the entity recognition model, and returning to the step of inputting the text to be analyzed into the pre-trained entity recognition model.
S203: and if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock.
S204: and under the condition that the verification vocabulary is an unknown vocabulary, completing the verification process of the round, and returning to the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as the verification vocabulary.
S205: and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
In the embodiment of the application, a text to be analyzed is obtained, the text to be analyzed is input into a pre-trained entity recognition model, and entity words in the text to be analyzed are recognized to obtain a candidate entity word set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. Furthermore, the method can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.
Based on the same inventive concept, the embodiment of the present application further provides a device for determining an unknown entity vocabulary corresponding to the method for determining an unknown entity vocabulary, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the method for determining an unknown entity vocabulary in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Example two
Referring to fig. 3, a schematic structural diagram of a device for determining an unknown entity vocabulary according to a second embodiment of the present application is shown, where the device includes: the acquisition module 31, the generation module 32, the processing module 33, and the determination module 34:
the obtaining module 31 is configured to obtain a text to be analyzed, input the text to be analyzed to a pre-trained entity recognition model, and recognize entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set;
a generating module 32, configured to generate a domain word stock based on a plurality of corpora that belong to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
the processing module 33 is configured to perform word segmentation on the corpus based on the candidate entity vocabulary set and the domain vocabulary library, and determine a plurality of candidate unknown entity vocabularies based on a result of the word segmentation; any one of the candidate unknown entity words belongs to the candidate entity word set;
A determining module 34, configured to determine at least one target unknown entity vocabulary from a plurality of the candidate unknown entity vocabularies.
In the embodiment of the application, a text to be analyzed is obtained, the text to be analyzed is input into a pre-trained entity recognition model, and entity words in the text to be analyzed are recognized to obtain a candidate entity word set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. Furthermore, the method can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.
In a possible implementation manner, the generating module 32 is specifically configured to, when generating a domain word stock based on a plurality of corpora that belong to the same domain as the text to be analyzed:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
In a possible implementation manner, the word segmentation processing result includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
the determining module 34 is specifically configured to, when determining a plurality of candidate unknown entity vocabularies based on the result of the word segmentation process:
based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
Any of the candidate unknown entity words belongs to the candidate entity word set.
In a possible implementation manner, the determining module 34 is specifically configured to, when determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
And through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
Example III
The embodiment of the application further provides a computer device 400, as shown in fig. 4, which is a schematic structural diagram of the computer device 400 provided in the embodiment of the application, including:
a processor 41, a memory 42, and a bus 43; memory 42 is used to store execution instructions, including memory 421 and external memory 422; the memory 421 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 41 and data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the computer device 400 operates, the processor 41 and the memory 42 communicate through the bus 43, so that the processor 41 executes the following instructions in a user mode:
acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set;
generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
Based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set;
and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies.
In a possible implementation manner, in the instructions executed by the processor 41, the generating a domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
In a possible implementation manner, in the instruction executed by the processor 41, the result of the word segmentation processing includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
based on the word segmentation processing result, determining a plurality of candidate unknown entity words, including:
Based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
any of the candidate unknown entity words belongs to the candidate entity word set.
In a possible implementation manner, in the instructions executed by the processor 41, the determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies includes:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
Under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
The present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the method for determining an unknown entity vocabulary described in the above method embodiment.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A method for determining unknown entity words is characterized by comprising the following steps:
acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set;
Generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set;
determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies;
the word segmentation processing result comprises: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
based on the word segmentation processing result, determining a plurality of candidate unknown entity words, including:
based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
Any of the candidate unknown entity words belongs to the candidate entity word set;
the determining at least one target unknown entity vocabulary from the plurality of candidate unknown entity vocabularies comprises the following steps:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
and selecting at least one candidate unknown entity word from a plurality of candidate unknown entity words according to the verification word, and taking the candidate unknown entity word as a target unknown entity word.
2. The method according to claim 1, wherein the generating a domain lexicon based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
3. The method of claim 1, wherein selecting at least one of the candidate unknown entity words from a plurality of the candidate unknown entity words as a target unknown entity word based on the verification word comprises:
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
If the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
4. A device for determining an unknown entity vocabulary, comprising:
the acquisition module is used for acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set;
the generation module is used for generating a domain word stock based on a plurality of linguistic data belonging to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;
The processing module is used for carrying out word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock, and determining a plurality of candidate unknown entity vocabularies based on the word segmentation processing result; any one of the candidate unknown entity words belongs to the candidate entity word set;
the determining module is used for determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies;
the word segmentation processing result comprises: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;
the determining module is specifically configured to, when determining a plurality of candidate unknown entity vocabularies based on a result of the word segmentation process:
based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;
determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;
Any of the candidate unknown entity words belongs to the candidate entity word set;
the determining at least one target unknown entity vocabulary from the plurality of candidate unknown entity vocabularies comprises the following steps:
randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
and selecting at least one candidate unknown entity word from a plurality of candidate unknown entity words according to the verification word, and taking the candidate unknown entity word as a target unknown entity word.
5. The apparatus of claim 4, wherein the generating module, when generating the domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed, is specifically configured to:
performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;
and forming the domain word library based on the entity words included in each corpus.
6. The apparatus of claim 4, wherein the determining module is configured to select, from a plurality of the candidate unknown entity words, at least one of the candidate unknown entity words as a target unknown entity word based on the verification word, specifically configured to:
Randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;
re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;
if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;
under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;
and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.
7. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of any one of claims 1 to 3.
8. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 3.
CN201911066527.3A 2019-11-04 2019-11-04 Method and device for determining unknown entity vocabulary Active CN110837730B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911066527.3A CN110837730B (en) 2019-11-04 2019-11-04 Method and device for determining unknown entity vocabulary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911066527.3A CN110837730B (en) 2019-11-04 2019-11-04 Method and device for determining unknown entity vocabulary

Publications (2)

Publication Number Publication Date
CN110837730A CN110837730A (en) 2020-02-25
CN110837730B true CN110837730B (en) 2023-05-05

Family

ID=69576070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911066527.3A Active CN110837730B (en) 2019-11-04 2019-11-04 Method and device for determining unknown entity vocabulary

Country Status (1)

Country Link
CN (1) CN110837730B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112417876A (en) * 2020-11-23 2021-02-26 北京乐学帮网络技术有限公司 Text processing method and device, computer equipment and storage medium
CN112926319B (en) * 2021-02-26 2024-01-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining domain vocabulary
CN114118090B (en) * 2021-11-12 2024-08-06 北京嘉和海森健康科技有限公司 Medical new entity name determining method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107577671A (en) * 2017-09-19 2018-01-12 中央民族大学 A kind of key phrases extraction method based on multi-feature fusion
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106910501B (en) * 2017-02-27 2019-03-01 腾讯科技(深圳)有限公司 Text entities extracting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN107577671A (en) * 2017-09-19 2018-01-12 中央民族大学 A kind of key phrases extraction method based on multi-feature fusion
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining

Also Published As

Publication number Publication date
CN110837730A (en) 2020-02-25

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN111160017B (en) Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US20150095017A1 (en) System and method for learning word embeddings using neural language models
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN112100354B (en) Man-machine conversation method, device, equipment and storage medium
CN110837730B (en) Method and device for determining unknown entity vocabulary
CN114580382A (en) Text error correction method and device
CN113535963B (en) Long text event extraction method and device, computer equipment and storage medium
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN110413972B (en) Intelligent table name field name complementing method based on NLP technology
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
Atia et al. Increasing the accuracy of opinion mining in Arabic
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
Ekbal et al. A hidden markov model based named entity recognition system: Bengali and hindi as case studies
CN110309504B (en) Text processing method, device, equipment and storage medium based on word segmentation
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
US20220114340A1 (en) System and method for an automatic search and comparison tool
Etaiwi et al. Statistical Arabic name entity recognition approaches: A survey
CN113821605A (en) Event extraction method
CN114416943A (en) Training method and device for dialogue model, electronic equipment and storage medium
CN111400340B (en) Natural language processing method, device, computer equipment and storage medium
JPH11328317A (en) Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
Schneider et al. Rerunning OCR: A machine learning approach to quality assessment and enhancement prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant