CN110837730B

CN110837730B - Method and device for determining unknown entity vocabulary

Info

Publication number: CN110837730B
Application number: CN201911066527.3A
Authority: CN
Inventors: 付骁弈; 徐猛; 张�杰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2023-05-05
Anticipated expiration: 2039-11-04
Also published as: CN110837730A

Abstract

The application provides a method and a device for determining unknown entity words, wherein the method comprises the following steps: acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. The method for determining the unknown entity words can judge whether the unknown words are entity words or not while finding the unknown words, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown words are entity words or not so as to improve the accuracy of determining the unknown entity words.

Description

Method and device for determining unknown entity vocabulary

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for determining an unknown entity vocabulary.

Background

Natural language processing (natural language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. In the category of natural language processing technology, the discovery of unknown entity words is a basic task for judging which character fragments may belong to the unknown entity words in a batch of corpus.

In the industries of bid monitoring, risk early warning and the like, the unknown entity vocabulary has wide application prospect, and plays a decisive role in downstream task precision such as word segmentation and the like. However, in the prior art, it is often impossible to determine whether an unknown word is an entity word or not, or the target accuracy cannot be achieved in a specific field.

Therefore, an accurate determination of unknown entity vocabulary is a current urgent problem to be solved.

Disclosure of Invention

In view of this, an object of the present application is to provide a method and an apparatus for determining an unknown entity vocabulary, which determine whether the unknown vocabulary is an entity vocabulary while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while determining whether the unknown vocabulary is an entity vocabulary, so as to improve accuracy of determining the unknown entity vocabulary.

In a first aspect, an embodiment of the present application provides a method for determining an unknown entity vocabulary, including:

acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity words in the text to be analyzed to obtain a candidate entity word set;

generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;

based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set;

and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies.

In an optional implementation manner, the generating a domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:

performing entity recognition processing on each corpus to obtain entity vocabularies included in each corpus;

and forming the domain word library based on the entity words included in each corpus.

In an alternative embodiment, the result of the word segmentation process includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;

based on the word segmentation processing result, determining a plurality of candidate unknown entity words, including:

based on a word frequency reverse file frequency TF-IDF method, performing word frequency statistics on each word segmentation vocabulary appearing in each corpus to obtain the frequency of each word segmentation vocabulary appearing in each corpus;

determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus;

any of the candidate unknown entity words belongs to the candidate entity word set.

In an alternative embodiment, the determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies includes:

randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as a verification vocabulary;

re-selecting sample data to re-train the entity recognition model under the condition that the verification vocabulary does not form an entity vocabulary, and returning to the step of inputting the text to be analyzed into a pre-trained entity recognition model;

If the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock;

under the condition that the verification vocabulary is an unknown vocabulary, the round of verification process is completed, and the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies is returned to as the verification vocabulary;

and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.

In a second aspect, an embodiment of the present application further provides a device for determining an unknown entity vocabulary, where the device for determining an unknown entity vocabulary includes: the device comprises an acquisition module, a generation module, a processing module and a determination module, wherein:

the acquisition module is used for acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set;

the generation module is used for generating a domain word stock based on a plurality of linguistic data belonging to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;

The processing module is used for carrying out word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock, and determining a plurality of candidate unknown entity vocabularies based on the word segmentation processing result; any one of the candidate unknown entity words belongs to the candidate entity word set;

and the determining module is used for determining at least one target unknown entity vocabulary from the candidate unknown entity vocabularies.

In an optional implementation manner, the generating module is specifically configured to, when generating a domain lexicon based on a plurality of corpora that belong to the same domain as the text to be analyzed:

the determining module is specifically configured to, when determining a plurality of candidate unknown entity vocabularies based on a result of the word segmentation process:

In an alternative embodiment, the determining module is specifically configured to, when determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies:

In a third aspect, embodiments of the present application further provide an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.

In a fourth aspect, the embodiments of the present application further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementation manners of the first aspect.

According to the method and the device for determining the unknown entity vocabulary, the text to be analyzed is obtained, the text to be analyzed is input into the entity recognition model trained in advance, and the entity vocabulary in the text to be analyzed is recognized to obtain the candidate entity vocabulary set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. Compared with the technical method for determining the unknown entity vocabulary in the prior art, the method for determining the unknown entity vocabulary can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating a method for determining unknown entity words provided by an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for verifying correctness of unknown entity vocabulary according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a determining device for determining an unknown entity vocabulary according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

According to research, in the existing method, a new word discovery model based on the aggregation degree of N words (N-gram) is generally adopted to determine unknown words, and the method can only generally construct word libraries, and although whether the unknown words are new words can be judged by comparing the existing word libraries, more information about the new words, such as whether the information is entity words, and the like, cannot be given; in addition, the existing method belongs to a general statistic-based method, and has great limitation on accuracy of judging whether an unknown word is an entity word or not in a specific field.

Therefore, in the prior art, it is often impossible to determine whether an unknown word is an entity word or not, or the target accuracy cannot be achieved in a specific field.

Based on the above research, the present application provides a method and an apparatus for determining an unknown entity vocabulary, which can obtain a text to be analyzed, input the text to be analyzed into a pre-trained entity recognition model, and recognize the entity vocabulary in the text to be analyzed to obtain a candidate entity vocabulary set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. The method can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.

The present invention is directed to a method for manufacturing a semiconductor device, and a semiconductor device manufactured by the method.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the sake of understanding the present embodiment, first, a method for determining an unknown entity vocabulary disclosed in the present embodiment of the present application will be described in detail. In particular, its execution subject may also be other computer devices.

Example 1

Referring to fig. 1, a flowchart of a method for determining an unknown entity vocabulary according to an embodiment of the present application is shown, where the method includes steps S101 to S104, where:

s101: and acquiring a text to be analyzed, inputting the text to be analyzed into a pre-trained entity recognition model, and recognizing entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set.

S102: generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus.

S103: based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set.

S104: and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies.

The following describes the above-mentioned steps S101 to S104 in detail.

And (3) a step of: in S101, a text to be analyzed is obtained, and the text to be analyzed is input into a pre-trained entity recognition model, and entity vocabularies in the text to be analyzed are recognized to obtain a candidate entity vocabulary set.

The entity recognition model adopted in the application should complete training on the domain corpus marked by the expert in advance.

By way of example, the entity recognition model used in the present application may employ a neural network model based on a Long Short-Term Memory (LSTM) and a conditional random field (conditional random field, CRF), but is not limited to the above model.

When the entity vocabulary in the text to be analyzed is identified, a BIO labeling system can be adopted, wherein each letter represents (B-begin, I-insert, O-other) respectively.

Illustratively, after obtaining a piece of text to be analyzed, e.g., "… successfully following 47.24% of the equity owner's xx securities with 30 years history, guangzhou development area xx controlled-ply company limited has devised a name featuring Guangzhou: knowledge of the securities. ".

Where "xx securities" may be labeled as B_subj I_subj and "Guangzhou development area xx controlled group Co., ltd" may be labeled as: b_subjj i_subjj I/u subjj i_subjj i_subj, while the other words are marked O.

Thus, the entity vocabulary in the text to be analyzed can be marked out, and the rest operation is performed.

Different from a named entity recognition method on a general corpus, the named entity recognition method only marks entity vocabularies, and does not distinguish between entity types such as person names, place names, mechanism names, proper nouns and the like.

After the above operations are completed, the candidate entity vocabulary set may be output.

And II: in the step S102, a domain word stock is generated based on a plurality of corpora belonging to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus.

Illustratively, a new word discovery model based on N-word (N-gram) aggregation is run in a massive domain corpus. The model is also called Chinese language model, can utilize collocation information between adjacent words in the context, when the continuous Chinese phonetic alphabets without spaces, strokes or numbers representing letters or strokes are required to be converted into Chinese character strings (i.e. sentences), sentences with the highest probability can be calculated, so that automatic conversion of Chinese characters is realized, manual selection of a user is not required, and the problem of repeated codes of a plurality of Chinese characters corresponding to the same Chinese phonetic alphabets (or stroke strings or number strings) is avoided. The model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus.

For example, the massive domain corpus and the text to be analyzed in S101 should belong to the same domain, and at the same time, the scale of the massive domain corpus should be far greater than that of the text to be analyzed in S101.

In the step, entity recognition processing is carried out on each corpus to obtain entity vocabularies included in each corpus; and forming the domain word library based on the entity words included in each corpus.

After the above operation is completed, a domain word stock corresponding to the text to be analyzed in the above S101 may be output.

Thirdly,: in S103, word segmentation is performed on the corpus based on the candidate entity vocabulary set and the domain vocabulary library, and a plurality of candidate unknown entity vocabularies are determined based on the result of the word segmentation; any one of the candidate unknown entity words belongs to the candidate entity word set.

Based on the candidate entity vocabulary set output in the step S101 and the domain vocabulary library output in the step S102, combining the candidate entity vocabulary set and the domain vocabulary library to construct a corresponding vocabulary, inputting the constructed vocabulary into a chinese word segmentation tool, and performing word segmentation processing to obtain a word segmentation processing result. Because Chinese has its specificity in basic grammar, it is necessary to recombine a continuous word sequence into a word sequence according to a certain specification, and this process is called word segmentation.

For example, the vocabulary constructed as described above may be input into the chinese word segmentation tool Jieba. Jieba is one of the most powerful tool kits used in chinese natural language processing, and can implement various functions including word segmentation, part-of-speech tagging, named entity recognition, and the like.

Wherein, the word segmentation processing result comprises: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock.

After the word segmentation result is obtained, a plurality of candidate unknown entity words can be determined based on the word segmentation result.

The word frequency statistics can be carried out on each word segmentation vocabulary appearing in each corpus based on a word frequency reverse file frequency TF-IDF method, so that the frequency of each word segmentation vocabulary appearing in each corpus is obtained;

for example, TF-IDF is a statistical method used to evaluate the importance of a word to a document in a corpus or corpus, where a text is analyzed.

Wherein the importance of the vocabulary increases proportionally with the number of occurrences in the text to be analyzed, but at the same time decreases inversely with the frequency of occurrences in the corpus. For example, a word "may appear a large number of times in the text to be analyzed, but it also appears in a corpus with a high frequency, and thus is of little importance.

And determining a plurality of candidate unknown entity words from a plurality of word segmentation words based on the candidate entity word set and the occurrence frequency of each word segmentation word in each corpus.

The word segmentation vocabulary after word frequency statistics by using the method of word frequency reverse file frequency TF-IDF is sorted, and a plurality of vocabulary with the lowest word frequency are selected as candidate unknown entity vocabulary.

And after the steps are completed, the candidate unknown entity vocabulary can be output.

Fourth, the method comprises the following steps: in S104, at least one target unknown entity vocabulary is determined from among the plurality of candidate unknown entity vocabularies.

Wherein, the correctness of the unknown entity vocabulary is judged for the candidate unknown entity vocabulary output in the step S103.

Illustratively, the candidate unknown entity vocabulary output in the step S103 is sampled by a self-help method, and the correctness of the unknown entity vocabulary is judged for the sampled candidate unknown entity vocabulary.

Referring to fig. 2, a flowchart of a method for checking correctness of unknown entity words provided in an embodiment of the present application is shown, where the method for checking correctness of unknown entity words performs correctness of unknown entity words with respect to the candidate unknown entity words sampled in fig. 1, and the method for checking correctness of unknown entity words includes steps S201 to S205, where:

S201: randomly determining at least one candidate unknown entity word from a plurality of candidate unknown entity words as a verification word.

S202: and under the condition that the verification vocabulary does not form an entity vocabulary, re-selecting sample data to re-train the entity recognition model, and returning to the step of inputting the text to be analyzed into the pre-trained entity recognition model.

S203: and if the verification vocabulary is a known vocabulary, removing the verification vocabulary from the candidate entity vocabulary set and the domain word stock, and returning to the step of word segmentation processing on the corpus based on the candidate entity vocabulary set and the domain word stock.

S204: and under the condition that the verification vocabulary is an unknown vocabulary, completing the verification process of the round, and returning to the step of randomly determining at least one candidate unknown entity vocabulary from a plurality of candidate unknown entity vocabularies as the verification vocabulary.

S205: and through multiple rounds of verification processes, the candidate unknown entity vocabulary obtained in the last round is used as a target unknown entity vocabulary.

In the embodiment of the application, a text to be analyzed is obtained, the text to be analyzed is input into a pre-trained entity recognition model, and entity words in the text to be analyzed are recognized to obtain a candidate entity word set; generating a domain word stock based on a plurality of corpus which belong to the same domain with the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus; based on the candidate entity vocabulary set and the domain word stock, word segmentation is carried out on the corpus, and a plurality of candidate unknown entity vocabularies are determined based on the word segmentation result; any one of the candidate unknown entity words belongs to the candidate entity word set; and determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies. Furthermore, the method can judge whether the unknown vocabulary is the entity vocabulary or not while finding the unknown vocabulary, and introduce expert guiding knowledge in the corresponding field while judging whether the unknown vocabulary is the entity vocabulary or not so as to improve the accuracy of determining the unknown entity vocabulary.

Based on the same inventive concept, the embodiment of the present application further provides a device for determining an unknown entity vocabulary corresponding to the method for determining an unknown entity vocabulary, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the method for determining an unknown entity vocabulary in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Example two

Referring to fig. 3, a schematic structural diagram of a device for determining an unknown entity vocabulary according to a second embodiment of the present application is shown, where the device includes: the acquisition module 31, the generation module 32, the processing module 33, and the determination module 34:

the obtaining module 31 is configured to obtain a text to be analyzed, input the text to be analyzed to a pre-trained entity recognition model, and recognize entity vocabularies in the text to be analyzed to obtain a candidate entity vocabulary set;

a generating module 32, configured to generate a domain word stock based on a plurality of corpora that belong to the same domain as the text to be analyzed; the domain word stock comprises a plurality of entity words appearing in each corpus;

the processing module 33 is configured to perform word segmentation on the corpus based on the candidate entity vocabulary set and the domain vocabulary library, and determine a plurality of candidate unknown entity vocabularies based on a result of the word segmentation; any one of the candidate unknown entity words belongs to the candidate entity word set;

A determining module 34, configured to determine at least one target unknown entity vocabulary from a plurality of the candidate unknown entity vocabularies.

In a possible implementation manner, the generating module 32 is specifically configured to, when generating a domain word stock based on a plurality of corpora that belong to the same domain as the text to be analyzed:

In a possible implementation manner, the word segmentation processing result includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;

the determining module 34 is specifically configured to, when determining a plurality of candidate unknown entity vocabularies based on the result of the word segmentation process:

In a possible implementation manner, the determining module 34 is specifically configured to, when determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies:

Example III

The embodiment of the application further provides a computer device 400, as shown in fig. 4, which is a schematic structural diagram of the computer device 400 provided in the embodiment of the application, including:

a processor 41, a memory 42, and a bus 43; memory 42 is used to store execution instructions, including memory 421 and external memory 422; the memory 421 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 41 and data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the computer device 400 operates, the processor 41 and the memory 42 communicate through the bus 43, so that the processor 41 executes the following instructions in a user mode:

In a possible implementation manner, in the instructions executed by the processor 41, the generating a domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:

In a possible implementation manner, in the instruction executed by the processor 41, the result of the word segmentation processing includes: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;

In a possible implementation manner, in the instructions executed by the processor 41, the determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies includes:

The present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the method for determining an unknown entity vocabulary described in the above method embodiment.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for determining unknown entity words is characterized by comprising the following steps:

determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies;

the word segmentation processing result comprises: a plurality of word segmentation vocabularies corresponding to each corpus respectively; any word segmentation vocabulary belongs to the candidate entity vocabulary set and/or belongs to the domain word stock;

Any of the candidate unknown entity words belongs to the candidate entity word set;

the determining at least one target unknown entity vocabulary from the plurality of candidate unknown entity vocabularies comprises the following steps:

and selecting at least one candidate unknown entity word from a plurality of candidate unknown entity words according to the verification word, and taking the candidate unknown entity word as a target unknown entity word.

2. The method according to claim 1, wherein the generating a domain lexicon based on a plurality of corpora belonging to the same domain as the text to be analyzed includes:

3. The method of claim 1, wherein selecting at least one of the candidate unknown entity words from a plurality of the candidate unknown entity words as a target unknown entity word based on the verification word comprises:

4. A device for determining an unknown entity vocabulary, comprising:

the determining module is used for determining at least one target unknown entity vocabulary from a plurality of candidate unknown entity vocabularies;

5. The apparatus of claim 4, wherein the generating module, when generating the domain word stock based on a plurality of corpora belonging to the same domain as the text to be analyzed, is specifically configured to:

6. The apparatus of claim 4, wherein the determining module is configured to select, from a plurality of the candidate unknown entity words, at least one of the candidate unknown entity words as a target unknown entity word based on the verification word, specifically configured to:

7. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of any one of claims 1 to 3.

8. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 3.