CN107885719B - Vocabulary category mining method and device based on artificial intelligence and storage medium - Google Patents

Vocabulary category mining method and device based on artificial intelligence and storage medium Download PDF

Info

Publication number
CN107885719B
CN107885719B CN201710854428.6A CN201710854428A CN107885719B CN 107885719 B CN107885719 B CN 107885719B CN 201710854428 A CN201710854428 A CN 201710854428A CN 107885719 B CN107885719 B CN 107885719B
Authority
CN
China
Prior art keywords
vocabulary
category
sentence
subject
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710854428.6A
Other languages
Chinese (zh)
Other versions
CN107885719A (en
Inventor
赵岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201710854428.6A priority Critical patent/CN107885719B/en
Publication of CN107885719A publication Critical patent/CN107885719A/en
Application granted granted Critical
Publication of CN107885719B publication Critical patent/CN107885719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a vocabulary category mining method, a vocabulary category mining device and a storage medium based on artificial intelligence, wherein the method comprises the following steps: main sentences containing the subject are excavated from the corpus to be excavated, the subject in each main sentence is respectively used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established; screening out a subject language description sentence from the excavated subject sentence, wherein the subject language description sentence is a subject sentence capable of showing the category to which the corresponding vocabulary belongs; and aiming at each vocabulary, analyzing the main language description sentences corresponding to the vocabulary respectively to determine the category of the vocabulary. By applying the scheme of the invention, the labor cost can be saved, the excavating efficiency is improved, and the universal applicability is realized.

Description

Vocabulary category mining method and device based on artificial intelligence and storage medium
[ technical field ] A method for producing a semiconductor device
The invention relates to a computer application technology, in particular to a vocabulary category mining method, a vocabulary category mining device and a storage medium based on artificial intelligence.
[ background of the invention ]
Artificial Intelligence (Artificial Intelligence), abbreviated in english as AI. The method is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.
With the development of artificial intelligence, text understanding technology becomes more and more important, and it is important in the text understanding technology to understand the meaning of each word in a text, wherein establishing a high-level category of the word is the basis of word understanding. For example, in an information recommendation system, lexical categories may be used to provide a precise understanding of information topics, and in a conversational system, lexical categories may be used to provide a precise understanding of user intent.
Accordingly, vocabulary category mining is required, and the following two common vocabulary category mining methods are mainly used:
1) and a text relation extraction mode: that is, the vocabulary relation is directly extracted from the sentence in which the vocabulary relation is explicitly described, for example, the category of the vocabulary "balsam pear" is "vegetable" can be extracted from the sentence "balsam pear is a vegetable";
2) the field word list construction mode is as follows: namely, the domain vocabulary relation is constructed manually, or the domain vocabulary is mined from the domain text, for example, the vocabulary of food materials and dishes can be mined from the menu.
However, both of the above two methods have certain problems in practical applications, for example, in the case of the method 1), the corpus to be mined needs to explicitly describe the vocabulary relationship in the sentence, otherwise, the method is not applicable, that is, the method has great limitations, and in the case of the method 2), the manual operation needs to be involved, so that the labor cost is increased, and the efficiency is low.
[ summary of the invention ]
In view of this, the invention provides a vocabulary category mining method, device and storage medium based on artificial intelligence, which can save labor cost, improve mining efficiency and have general applicability.
The specific technical scheme is as follows:
a vocabulary category mining method based on artificial intelligence comprises the following steps:
main sentences containing the subject are excavated from the corpus to be excavated, the subject in each main sentence is respectively used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established;
screening out a subject language description sentence from the excavated subject sentence, wherein the subject language description sentence is a subject sentence capable of showing the category to which the corresponding vocabulary belongs;
and aiming at each vocabulary, analyzing the main language description sentence corresponding to the vocabulary respectively to determine the category of the vocabulary.
According to a preferred embodiment of the present invention, the step of screening out the subject descriptive sentence from the mined main sentence comprises:
for each main sentence, respectively determining whether the main sentence is a main language description sentence by using a preset rule set;
or, for each main sentence, determining whether the main sentence is a main language descriptive sentence or not by using a pre-trained binary model.
According to a preferred embodiment of the present invention, the determining, for each vocabulary, the category to which the vocabulary belongs by analyzing the subject description sentence corresponding to the vocabulary includes:
for each vocabulary, the following processing is performed:
performing coarse-grained classification according to the subject language description sentence corresponding to the vocabulary, and determining the coarse-grained category to which the vocabulary belongs;
refining the coarse-grained category according to a main language descriptive sentence corresponding to the vocabulary, and determining a fine-grained category to which the vocabulary belongs;
the fine grain class is a lower class than the coarse grain class.
According to a preferred embodiment of the present invention, the classifying the vocabulary according to the coarse-grained classification of the subject description sentence corresponding to the vocabulary, and determining the coarse-grained classification to which the vocabulary belongs includes:
and determining the coarse-grained category of the vocabulary according to the main language description sentence corresponding to the vocabulary and through a first classification model obtained by pre-training.
According to a preferred embodiment of the present invention, the refining the coarse-grained category according to the main language descriptive sentence corresponding to the vocabulary, and determining the fine-grained category to which the vocabulary belongs includes:
forming a set A by utilizing the lower classes of the coarse-grained classes;
determining whether a main language description sentence corresponding to the vocabulary contains a category name in the set A;
if so, extracting the fine-grained category of the vocabulary from the subject language description sentence corresponding to the vocabulary in a text relation extraction mode;
if not, determining the fine-grained category of the vocabulary according to the subject language description sentence corresponding to the vocabulary and a second classification model obtained through pre-training.
An artificial intelligence based vocabulary category mining apparatus, comprising: the device comprises an acquisition unit, a screening unit and a classification unit;
the acquisition unit is used for excavating main sentences containing the subject from the corpus to be excavated, taking the subject in each main sentence as a vocabulary, and establishing the corresponding relation between the vocabulary and the main sentence where the vocabulary is located;
the screening unit is used for screening out a subject language description sentence from the excavated subject sentence, wherein the subject language description sentence is a subject sentence capable of showing the category to which the corresponding vocabulary belongs;
and the classification unit is used for analyzing the subject language description sentence corresponding to each vocabulary and determining the category of the vocabulary.
In accordance with a preferred embodiment of the present invention,
the screening unit determines whether the main sentences are main language description sentences or not by utilizing a preset rule set aiming at each main sentence;
or, the screening unit determines whether the main sentence is a main language descriptive sentence or not by using a pre-trained binary model for each main sentence.
According to a preferred embodiment of the present invention, the classification unit comprises: a first sorting subunit and a second sorting subunit;
the first classification subunit is configured to, for each vocabulary, perform coarse-grained classification according to the subject description sentence corresponding to the vocabulary, and determine a coarse-grained category to which the vocabulary belongs;
the second classification subunit is configured to, for each vocabulary, refine the coarse-grained category according to the subject description sentence corresponding to the vocabulary on the basis of the coarse-grained classification, and determine a fine-grained category to which the vocabulary belongs;
the fine grain class is a lower class than the coarse grain class.
According to a preferred embodiment of the present invention, the first classification subunit determines, according to the subject language descriptive sentence corresponding to the vocabulary, the coarse-grained classification to which the vocabulary belongs, through a first classification model obtained through pre-training.
According to a preferred embodiment of the present invention, the second classification subunit uses the lower level classes of the coarse-grained class to form a set a, determines whether the subject descriptive sentence corresponding to the vocabulary contains the class name in the set a, if so, extracts the fine-grained class to which the vocabulary belongs from the subject descriptive sentence corresponding to the vocabulary in a text relation extraction manner, and if not, determines the fine-grained class to which the vocabulary belongs according to the subject descriptive sentence corresponding to the vocabulary through a second classification model obtained through pre-training.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the introduction, the scheme of the invention can firstly dig out the main sentences containing the subject from the corpus to be excavated, and the subject in each main sentence is respectively used as a vocabulary, the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established, and then the main language description sentences can be further screened out from the dug out main sentences, and then the corresponding main language description sentences are respectively analyzed for each vocabulary to determine the category of the vocabulary.
[ description of the drawings ]
FIG. 1 is a flowchart of a first embodiment of the vocabulary category mining method based on artificial intelligence according to the present invention.
FIG. 2 is a flowchart illustrating a second embodiment of the vocabulary category mining method based on artificial intelligence according to the present invention.
FIG. 3 is a schematic diagram of the generic class system of the present invention.
FIG. 4 is a schematic structural diagram illustrating an embodiment of an artificial intelligence-based vocabulary category mining apparatus according to the present invention.
FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] embodiments
Aiming at the problems in the prior art, the invention provides a universal and efficient vocabulary category mining method, which comprises the steps of firstly obtaining a subject language description sentence related to a vocabulary, and then determining the category of the vocabulary, namely determining the superior category of the vocabulary by analyzing the subject language description sentence and the like.
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a flowchart of a first embodiment of the vocabulary category mining method based on artificial intelligence according to the present invention. As shown in fig. 1, the following detailed implementation is included.
In 101, main sentences containing subjects are mined from the corpus to be mined, the subjects in each main sentence are respectively used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established.
At 102, a subject description sentence is screened from the mined subject sentences, wherein the subject description sentence is a subject sentence capable of representing the category to which the corresponding vocabulary belongs.
At 103, for each vocabulary, the category to which the vocabulary belongs is identified by analyzing the subject description sentence corresponding to the vocabulary.
It can be seen that, in order to implement the solution described in the above embodiment, the corpus to be mined needs to be obtained first, and the sources of the corpus to be mined may include encyclopedic text, document data, news information, trusted web pages, and the like.
In principle, any text resource can be used, but in order to guarantee the accuracy of the subsequent mining result, the text resource with higher reliability can be adopted as much as possible, or the language material related to the task, such as dialogue language material and the like.
After the corpus to be mined is obtained, a series of processing can be performed on the corpus so as to mine a main sentence from the corpus.
For example, the text and the like in the corpus to be mined can be processed into sentences, then each sentence is analyzed in a syntactic manner, the sentences containing the subject are extracted from the sentences, the extracted sentences are used as main sentences, the subject in each main sentence is used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established.
Then, a subject description sentence, which is a subject sentence capable of representing the category to which the corresponding vocabulary belongs, can be further screened from the mined subject sentences.
Taking the subject a as an example, the subject description sentence is a sentence description of the subject a, and the category to which the subject a belongs can be determined through the sentence (without depending on external information), and the subject description sentence may be an explicit definition sentence of the subject a, such as "a is an actor", or may not be an explicit definition sentence, such as "a gives a lead to a movie", as long as the category of the subject a can be inferred based on common sense.
After the subject language description sentences are screened out, the category of the vocabulary can be determined by analyzing the subject language description sentences corresponding to the vocabulary respectively.
Preferably, for each vocabulary, coarse-grained classification may be performed according to the main language description sentence corresponding to the vocabulary, to determine a coarse-grained category to which the vocabulary belongs, and then, the determined coarse-grained category may be refined according to the main language description sentence corresponding to the vocabulary, to further determine a fine-grained category to which the vocabulary belongs, where the fine-grained category is a lower category of the coarse-grained category.
For example, the coarse-grained category may be "character", the fine-grained category may be "actor", "singer", etc., and "actor" and "singer" are all lower categories of "character".
Based on the above description, fig. 2 is a flowchart of a second embodiment of the vocabulary category mining method based on artificial intelligence according to the present invention. As shown in fig. 2, the following detailed implementation is included.
In 201, a corpus to be mined is obtained.
The sources of the corpus to be mined may include encyclopedia text, document data, news information, trusted web pages, and so on.
In principle, any text resource can be used, but in order to guarantee the accuracy of the subsequent mining result, the text resource with higher reliability can be adopted as much as possible, or the language material related to the task, such as dialogue language material and the like.
In 202, main sentences containing subjects are mined from the corpus to be mined, the subjects in each main sentence are respectively used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established.
For example, the text and the like in the corpus to be mined can be processed into sentences, then each sentence is analyzed in a syntactic manner, sentences containing the subject are extracted from the sentences, and the extracted sentences are used as main sentences.
The main sentence described in this embodiment is generally a main sentence including only one subject.
In addition, the subject in each main sentence can be used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established.
For example, for each vocabulary, a triple of < vocabulary, main sentence, source > may be generated.
In 203, subject descriptive sentences are screened from the mined subject sentences.
As described above, the subject description sentence is a subject sentence that can represent the category to which the corresponding word belongs.
For each triple of < vocabulary, main sentence and source >, whether the main sentence is a main language description sentence of the corresponding vocabulary is respectively determined so as to distinguish the common statement sentence from the main language description sentence containing the classification relation of the main language.
Specifically, for each main sentence, whether the main sentence is a main language description sentence can be determined by using a preset rule set.
The rule set may be predefined to define which syntax rule is satisfied, and the main sentence is the main language description sentence, or which content elements are contained.
Alternatively, for each main sentence, a pre-trained binary classification model (a binary classification text classification model) may be used to determine whether the main sentence is a main language descriptive sentence.
The two-classification model can be obtained by training a training sample, after the training is finished, the main sentence is input into the two-classification model, and an output classification result can be obtained, wherein the classification result comprises a subject description sentence and a non-subject description sentence, and the prior art is used for obtaining the two-classification model by training.
At 204, for each vocabulary, the coarse-grained classification is performed according to the corresponding subject description sentence of the vocabulary, and the coarse-grained classification of the vocabulary is determined.
A general category system can be pre-constructed, as shown in fig. 3, fig. 3 is a schematic diagram of the general category system according to the present invention, and it can be seen that this is a category hierarchical network with a directed acyclic graph structure, each node in the network is a category, and edges between nodes represent upper and lower relationships of categories, as shown in fig. 3, "actor" and "singer" are lower categories of "character", respectively, and "character" is a lower category of "thing".
The subject language description sentences are classified in a coarse-grained mode, a supervised classification model can be used in the classification method, in order to reduce the difficulty of training sample construction, the classification category can be an upper-layer category in a general category system, and for example, the coarse-grained category can be a category of a figure.
Correspondingly, for each vocabulary, the coarse-grained category of the vocabulary can be determined according to the main language description sentence corresponding to the vocabulary and the first classification model obtained by pre-training.
The first classification model may be a commonly used text classification model, such as a support vector machine model, a convolutional neural network model, and how to train to obtain the first classification model is prior art.
Through the above processing, for each vocabulary, the following four-tuple < vocabulary, subject description sentence, source, category > can be obtained, respectively.
In 205, for each vocabulary, the coarse-grained category to which the determined vocabulary belongs is refined according to the main language description sentence corresponding to the vocabulary, and the fine-grained category to which the determined vocabulary belongs is determined.
After the coarse-grained category to which the vocabulary belongs is determined, the vocabulary needs to be further refined, so that the fine-grained category to which the vocabulary belongs is determined, wherein the fine-grained category is from a set formed by lower categories of the coarse-grained category to which the vocabulary belongs in a general category system, and the accuracy of a refined result is guaranteed.
In addition, when the determined coarse-grained type is refined according to the subject language description sentence corresponding to the vocabulary to determine the fine-grained type to which the vocabulary belongs, different processing modes can be adopted according to different contents contained in the subject language description sentence corresponding to the vocabulary.
For example, when the subject descriptive sentence includes the category name in the set, a display relationship extraction method may be used to determine the fine-grained category to which the vocabulary belongs, and when the subject descriptive sentence does not include the category name in the set, an implicit relationship discrimination method may be used to determine the fine-grained category to which the vocabulary belongs.
Based on the above description, for each vocabulary, the following processes can be performed:
forming a set A by using the lower level category of the coarse-grained category to which the vocabulary belongs;
determining whether the main language description sentence corresponding to the vocabulary contains the category name in the set A;
if yes, determining the fine-grained category to which the vocabulary belongs by adopting a display relationship extraction method, and if the existing text relationship extraction method can be utilized, extracting the fine-grained category to which the vocabulary belongs from the main language description sentence corresponding to the vocabulary;
if not, determining the fine-grained category to which the vocabulary belongs by adopting an implicit relation discrimination method, and determining the fine-grained category to which the vocabulary belongs by a pre-trained second classification model according to the main language description sentence corresponding to the vocabulary.
For example, if the coarse category to which the vocabulary belongs is "person", the set a of the lower categories of the coarse category to which the vocabulary belongs includes "actor", "singer", and the like.
Assuming that the subject descriptive sentence is "S is an actor" and includes the category name "actor" in set a directly, then a display relationship extraction method may be employed to determine the fine-grained category "actor" to which the vocabulary S belongs.
Assuming that the subject descriptive sentence is "S has played a movie", which does not directly contain the category names in the set a, an implicit relationship discrimination method can be used to determine the fine-grained category "actor" to which the vocabulary S belongs.
The second classification model may be a commonly used text classification model, such as a convolutional neural network model, and how to train to obtain the second classification model is the prior art.
By the method, the fine-grained category to which each vocabulary belongs can be obtained respectively.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In short, by adopting the solutions of the above method embodiments, the main sentences including the subject are first mined from the corpus to be mined, the subject in each main sentence is respectively used as a vocabulary, the corresponding relationship between the vocabulary and the main sentence where the vocabulary is located is established, then the subject description sentences can be further screened from the mined main sentences, and further the subject description sentences are determined by analyzing the subject description sentences corresponding to the vocabularies for each vocabulary.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
FIG. 4 is a schematic structural diagram illustrating an embodiment of an artificial intelligence-based vocabulary category mining apparatus according to the present invention. As shown in fig. 4, includes: acquisition section 401, filtering section 402, and classification section 403.
The obtaining unit 401 is configured to dig out main sentences including a subject from a corpus to be dug, use the subject in each main sentence as a vocabulary, and establish a corresponding relationship between the vocabulary and the main sentence in which the vocabulary is located.
A screening unit 402, configured to screen a subject description sentence from the mined subject sentence, where the subject description sentence is a subject sentence that can represent a category to which a corresponding vocabulary belongs.
The classifying unit 403 is configured to determine, for each vocabulary, a category to which the vocabulary belongs by analyzing the main language description sentence corresponding to the vocabulary.
It can be seen that, to implement the solution of this embodiment, the obtaining unit 401 needs to first obtain the corpus to be mined, where the sources of the corpus to be mined may include encyclopedia texts, document data, news information, trusted web pages, and the like.
In principle, any text resource can be used, but in order to guarantee the accuracy of the subsequent mining result, the text resource with higher reliability can be adopted as much as possible, or the language material related to the task, such as dialogue language material and the like.
After acquiring the corpus to be mined, the acquisition unit 401 may perform a series of processes thereon so as to mine the main sentence therefrom.
For example, the obtaining unit 401 may process the text and the like in the corpus to be mined into sentences, and then perform syntactic analysis on each sentence, extract a sentence including a subject from the sentences, use the extracted sentence as a main sentence, use the subject in each main sentence as a vocabulary, and establish a correspondence between the vocabulary and the main sentence in which the vocabulary is located.
Then, the filtering unit 402 may further filter out a subject description sentence from the mined subject sentences, where the subject description sentence is a subject sentence that can represent the category to which the corresponding vocabulary belongs.
Taking the subject a as an example, the subject description sentence is a sentence description of the subject a, and the category to which the subject a belongs can be determined through the sentence (without depending on external information), and the subject description sentence may be an explicit definition sentence of the subject a, such as "a is an actor", or may not be an explicit definition sentence, such as "a gives a lead to a movie", as long as the category of the subject a can be inferred based on common sense.
Specifically, the screening unit 402 may determine, for each main sentence, whether the main sentence is a main language description sentence by using a preset rule set, or determine, for each main sentence, whether the main sentence is a main language description sentence by using a pre-trained binary model.
After the subject language description sentences are screened out, the classifying unit 403 may determine the category to which the vocabulary belongs by analyzing the subject language description sentence corresponding to the vocabulary, respectively, for each vocabulary.
Preferably, for each vocabulary, the classifying unit 403 may first perform coarse-grained classification according to the main language description sentence corresponding to the vocabulary to determine the coarse-grained category to which the vocabulary belongs, and then may refine the determined coarse-grained category according to the main language description sentence corresponding to the vocabulary to further determine the fine-grained category to which the vocabulary belongs, where the fine-grained category is a lower category of the coarse-grained category.
Accordingly, as shown in fig. 4, the classification unit 403 may further include: a first classification subunit 4031 and a second classification subunit 4032.
A first classification subunit 4031, configured to perform coarse-grained classification on each vocabulary according to the main language description sentence corresponding to the vocabulary, and determine a coarse-grained category to which the vocabulary belongs.
And a second classification subunit 4032, configured to, on the basis of coarse-grained classification, refine a coarse-grained category according to the main language description sentence corresponding to the vocabulary, and determine a fine-grained category to which the vocabulary belongs.
Specifically, the first classification subunit 4031 may determine, according to the main language description sentence corresponding to the vocabulary, the coarse-grained category to which the vocabulary belongs through a first classification model obtained through pre-training.
The first classification model may be a commonly used text classification model, such as a support vector machine model, a convolutional neural network model, or the like.
After the coarse-grained category to which the vocabulary belongs is determined, the vocabulary needs to be further refined so as to determine the fine-grained category to which the vocabulary belongs, wherein the fine-grained category is from a set formed by lower categories of the coarse-grained category to which the vocabulary belongs, so that the accuracy of a refined result is ensured, and the like.
In addition, when the determined coarse-grained type is refined according to the subject language description sentence corresponding to the vocabulary to determine the fine-grained type to which the vocabulary belongs, different processing modes can be adopted according to different contents contained in the subject language description sentence corresponding to the vocabulary.
For example, when the subject descriptive sentence includes the category name in the set, a display relationship extraction method may be used to determine the fine-grained category to which the vocabulary belongs, and when the subject descriptive sentence does not include the category name in the set, an implicit relationship discrimination method may be used to determine the fine-grained category to which the vocabulary belongs.
To this end, the second classification subunit 4032 may perform the following processing for each vocabulary:
forming a set A by using the lower level category of the coarse-grained category to which the vocabulary belongs;
determining whether the main language description sentence corresponding to the vocabulary contains the category name in the set A;
if yes, determining the fine-grained category to which the vocabulary belongs by adopting a display relationship extraction method, and if the existing text relationship extraction method can be utilized, extracting the fine-grained category to which the vocabulary belongs from the main language description sentence corresponding to the vocabulary;
if not, determining the fine-grained category to which the vocabulary belongs by adopting an implicit relation discrimination method, and determining the fine-grained category to which the vocabulary belongs by a pre-trained second classification model according to the main language description sentence corresponding to the vocabulary.
For example, if the coarse category to which the vocabulary belongs is "person", the set a of the lower categories of the coarse category to which the vocabulary belongs includes "actor", "singer", and the like.
Assuming that the subject descriptive sentence is "S is an actor" and includes the category name "actor" in set a directly, then a display relationship extraction method may be employed to determine the fine-grained category "actor" to which the vocabulary S belongs.
Assuming that the subject descriptive sentence is "S has played a movie", which does not directly contain the category names in the set a, an implicit relationship discrimination method can be used to determine the fine-grained category "actor" to which the vocabulary S belongs.
For a specific work flow of the embodiment of the apparatus shown in fig. 4, reference is made to the related descriptions in the foregoing method embodiments, and details are not repeated.
FIG. 5 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 5 is only one example and should not be taken to limit the scope of use or functionality of embodiments of the present invention.
As shown in FIG. 5, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 20. As shown in FIG. 5, the network adapter 20 communicates with the other modules of the computer system/server 12 via the bus 18. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing by executing the program stored in the memory 28, for example, implementing the method in the embodiment shown in fig. 1 or 2, that is, mining a main sentence including a subject from the corpus to be mined, regarding the subject in each main sentence as a word, establishing a corresponding relationship between the word and the main sentence in which the word is located, screening out a main description sentence from the mined main sentence, where the main description sentence is a main sentence capable of representing the category to which the corresponding word belongs, and determining the category to which the word belongs by analyzing the main description sentence corresponding to the word for each word.
For specific implementation, please refer to the related descriptions in the foregoing embodiments, and further description is omitted.
The invention also discloses a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, will carry out the method as in the embodiments of fig. 1 or 2.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (12)

1. A vocabulary category mining method based on artificial intelligence is characterized by comprising the following steps:
main sentences containing the subject are excavated from the corpus to be excavated, the subject in each main sentence is respectively used as a vocabulary, and the corresponding relation between the vocabulary and the main sentence where the vocabulary is located is established;
screening out a subject language description sentence from the excavated subject sentence, wherein the subject language description sentence is a subject sentence capable of showing the category to which the corresponding vocabulary belongs;
for each vocabulary, respectively analyzing a main language description sentence corresponding to the vocabulary to determine the category of the vocabulary, wherein the category comprises the following steps: determining the category to which the vocabulary belongs according to a pre-constructed general category system, wherein the general category system is a category hierarchical network with a directed acyclic graph structure, each node in the network is a category, edges among the nodes represent the upper and lower relations of the categories, and the category to which the vocabulary belongs is the category in the general category system.
2. The method of claim 1,
the step of screening out the subject descriptive sentences from the mined subject sentences comprises the following steps:
for each main sentence, respectively determining whether the main sentence is a main language description sentence by using a preset rule set;
or, for each main sentence, determining whether the main sentence is a main language descriptive sentence or not by using a pre-trained binary model.
3. The method of claim 1,
for each vocabulary, respectively analyzing the main language description sentences corresponding to the vocabulary, and determining the category of the vocabulary comprises the following steps:
for each vocabulary, the following processing is performed:
performing coarse-grained classification according to the subject language description sentence corresponding to the vocabulary, and determining the coarse-grained category to which the vocabulary belongs;
refining the coarse-grained category according to a main language descriptive sentence corresponding to the vocabulary, and determining a fine-grained category to which the vocabulary belongs;
the fine grain class is a lower class than the coarse grain class.
4. The method of claim 3,
the coarse-grained classification according to the subject language description sentence corresponding to the vocabulary, and the determining of the coarse-grained classification of the vocabulary comprises the following steps:
and determining the coarse-grained category of the vocabulary according to the main language description sentence corresponding to the vocabulary and through a first classification model obtained by pre-training.
5. The method of claim 3,
the step of refining the coarse-grained category according to the subject language descriptive sentence corresponding to the vocabulary, and the step of determining the fine-grained category to which the vocabulary belongs comprises the following steps:
forming a set A by utilizing the lower classes of the coarse-grained classes;
determining whether a main language description sentence corresponding to the vocabulary contains a category name in the set A;
if so, extracting the fine-grained category of the vocabulary from the subject language description sentence corresponding to the vocabulary in a text relation extraction mode;
if not, determining the fine-grained category of the vocabulary according to the subject language description sentence corresponding to the vocabulary and a second classification model obtained through pre-training.
6. A vocabulary category mining device based on artificial intelligence is characterized by comprising: the device comprises an acquisition unit, a screening unit and a classification unit;
the acquisition unit is used for excavating main sentences containing the subject from the corpus to be excavated, taking the subject in each main sentence as a vocabulary, and establishing the corresponding relation between the vocabulary and the main sentence where the vocabulary is located;
the screening unit is used for screening out a subject language description sentence from the excavated subject sentence, wherein the subject language description sentence is a subject sentence capable of showing the category to which the corresponding vocabulary belongs;
the classifying unit is configured to determine, for each vocabulary, a category to which the vocabulary belongs by analyzing a subject description sentence corresponding to the vocabulary, and includes: determining the category to which the vocabulary belongs according to a pre-constructed general category system, wherein the general category system is a category hierarchical network with a directed acyclic graph structure, each node in the network is a category, edges among the nodes represent the upper and lower relations of the categories, and the category to which the vocabulary belongs is the category in the general category system.
7. The apparatus of claim 6,
the screening unit determines whether the main sentences are main language description sentences or not by utilizing a preset rule set aiming at each main sentence;
or, the screening unit determines whether the main sentence is a main language descriptive sentence or not by using a pre-trained binary model for each main sentence.
8. The apparatus of claim 6,
the classification unit comprises: a first sorting subunit and a second sorting subunit;
the first classification subunit is configured to, for each vocabulary, perform coarse-grained classification according to the subject description sentence corresponding to the vocabulary, and determine a coarse-grained category to which the vocabulary belongs;
the second classification subunit is configured to, for each vocabulary, refine the coarse-grained category according to the subject description sentence corresponding to the vocabulary on the basis of the coarse-grained classification, and determine a fine-grained category to which the vocabulary belongs;
the fine grain class is a lower class than the coarse grain class.
9. The apparatus of claim 8,
and the first classification subunit determines the coarse-grained classification of the vocabulary according to the main language description sentence corresponding to the vocabulary through a first classification model obtained by pre-training.
10. The apparatus of claim 8,
and the second classification subunit forms a set A by using the lower category of the coarse-grained category, determines whether the main language description sentence corresponding to the vocabulary contains the category name in the set A, if so, extracts the fine-grained category to which the vocabulary belongs from the main language description sentence corresponding to the vocabulary by using a text relation extraction mode, and if not, determines the fine-grained category to which the vocabulary belongs according to the main language description sentence corresponding to the vocabulary through a second classification model obtained by pre-training.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 5.
12. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 5.
CN201710854428.6A 2017-09-20 2017-09-20 Vocabulary category mining method and device based on artificial intelligence and storage medium Active CN107885719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710854428.6A CN107885719B (en) 2017-09-20 2017-09-20 Vocabulary category mining method and device based on artificial intelligence and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710854428.6A CN107885719B (en) 2017-09-20 2017-09-20 Vocabulary category mining method and device based on artificial intelligence and storage medium

Publications (2)

Publication Number Publication Date
CN107885719A CN107885719A (en) 2018-04-06
CN107885719B true CN107885719B (en) 2021-06-11

Family

ID=61780776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710854428.6A Active CN107885719B (en) 2017-09-20 2017-09-20 Vocabulary category mining method and device based on artificial intelligence and storage medium

Country Status (1)

Country Link
CN (1) CN107885719B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287318B (en) * 2019-06-06 2021-09-17 秒针信息技术有限公司 Service operation detection method and device, storage medium and electronic device
CN110263342A (en) * 2019-06-20 2019-09-20 北京百度网讯科技有限公司 Method for digging and device, the electronic equipment of the hyponymy of entity
CN110888971B (en) * 2019-11-29 2022-05-24 支付宝(杭州)信息技术有限公司 Multi-round interaction method and device for robot customer service and user
CN112966109B (en) * 2021-03-09 2023-04-18 北京邮电大学 Multi-level Chinese text classification method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970766B1 (en) * 2007-07-23 2011-06-28 Google Inc. Entity type assignment
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970766B1 (en) * 2007-07-23 2011-06-28 Google Inc. Entity type assignment
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN105912625A (en) * 2016-04-07 2016-08-31 北京大学 Linked data oriented entity classification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding;Xiang Ren et al.;《Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing》;20161101;第1369-1378页 *

Also Published As

Publication number Publication date
CN107885719A (en) 2018-04-06

Similar Documents

Publication Publication Date Title
CN110287278B (en) Comment generation method, comment generation device, server and storage medium
CN107908635B (en) Method and device for establishing text classification model and text classification
US10691766B2 (en) Analyzing concepts over time
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN107885719B (en) Vocabulary category mining method and device based on artificial intelligence and storage medium
US20180341698A1 (en) Method and apparatus for parsing query based on artificial intelligence, and storage medium
US9766868B2 (en) Dynamic source code generation
US9619209B1 (en) Dynamic source code generation
US11157444B2 (en) Generating index entries in source files
CN106970993B (en) Mining model updating method and device
US20200125671A1 (en) Altering content based on machine-learned topics of interest
US10360280B2 (en) Self-building smart encyclopedia
US10546063B2 (en) Processing of string inputs utilizing machine learning
Shruthi et al. A prior case study of natural language processing on different domain
US10372816B2 (en) Preprocessing of string inputs in natural language processing
KR101713612B1 (en) Intelligent Storytelling Support System
CN112181429A (en) Information processing method and device and electronic equipment
US11520839B2 (en) User based network document modification
US20230305863A1 (en) Self-Supervised System for Learning a User Interface Language
Ali Grammatical aspects of codeswitching in Farsi-English bilingual speech: a case study of Iranian immigrants in the UK
Swamy A prior case study of natural language processing on different domain.
US20210073335A1 (en) Methods and systems for semantic analysis of table content
CN115934079A (en) Interface element capturing method, electronic device and storage medium
CN116226375A (en) Training method and device for classification model suitable for text auditing
CN118052228A (en) Domain word determining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant