CN106649819B - Method and device for extracting entity words and hypernyms - Google Patents

Method and device for extracting entity words and hypernyms Download PDF

Info

Publication number
CN106649819B
CN106649819B CN201611247066.6A CN201611247066A CN106649819B CN 106649819 B CN106649819 B CN 106649819B CN 201611247066 A CN201611247066 A CN 201611247066A CN 106649819 B CN106649819 B CN 106649819B
Authority
CN
China
Prior art keywords
words
feature
hypernyms
webpage data
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611247066.6A
Other languages
Chinese (zh)
Other versions
CN106649819A (en
Inventor
庞伟
陈进平
苏文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201611247066.6A priority Critical patent/CN106649819B/en
Publication of CN106649819A publication Critical patent/CN106649819A/en
Application granted granted Critical
Publication of CN106649819B publication Critical patent/CN106649819B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for extracting entity words and hypernyms, which comprises the following steps: constructing a first training sample based on the first webpage data; training a first deep neural network model based on the first training samples; and extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words. The invention solves the technical problem of low efficiency in extracting the entity words and the hypernyms from the webpage information in the prior art, and realizes the technical effect of efficiently extracting the entity words and the hypernyms from the webpage information. Meanwhile, the invention also discloses a device for extracting the entity words and the hypernyms.

Description

Method and device for extracting entity words and hypernyms
Technical Field
The invention relates to the technical field of search, in particular to a method and a device for extracting entity words and hypernyms.
Background
In a search engine, entity words and hypernyms are important basic data, the concept category to which the user intention belongs is analyzed, the semantic distance between a user Query (Query) and a document is shortened, and the search engine is facilitated to retrieve search results related to potential semantics. Such as: the user inquires whether the new employee enjoys the welfare, and a certain webpage is titled as whether the new employee can enjoy the current salary vacation, and the user inquires that the webpage is semantically related to the reason that the superior word of the vacation is the welfare. This example illustrates that hypernyms can be used to solve a portion of the semantically related search problem. The entity words and the hypernyms are also basic data for constructing the knowledge graph and describe concepts and the entities and the superior-inferior relation among the entities. Therefore, the efficient mining method for researching the entity words and the hypernyms has a great application value, is a key technology in the field of information retrieval, and is a basic problem in the field of natural language processing.
The entity words and the superior words in the vertical field are generally manually mined, the accuracy is high, the field is basically easily covered, and the practical application is met. However, in the field of web page information retrieval, the number of entity words and hypernyms is huge, and the time cost of manual mining is too high, so that the extraction efficiency of the entity words and hypernyms is very low.
Disclosure of Invention
In view of the above, the present invention has been made to provide a method and apparatus for extracting a physical word and a hypernym, which overcome the above problems or at least partially solve the above problems.
In one aspect of the present invention, a method for extracting entity words and hypernyms is provided, which includes:
constructing a first training sample based on the first webpage data;
training a first deep neural network model based on the first training samples;
and extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words.
Preferably, the first webpage data is encyclopedic webpage data.
Preferably, the constructing a first training sample based on the first web page data includes:
classifying the encyclopedia webpage data to obtain U-type encyclopedia webpage data, wherein U is a positive integer;
and constructing the first training sample based on the U-class encyclopedia webpage data.
Preferably, the classifying the encyclopedic webpage data to obtain U-class encyclopedic webpage data includes:
extracting partial encyclopedic webpage data from the encyclopedic webpage data;
constructing a second training sample based on the partial encyclopedic webpage data;
training a second deep neural network model based on the second training samples;
and classifying the encyclopedic webpage data by utilizing the second deep neural network model to obtain the U-type encyclopedic webpage data.
Preferably, the constructing a second training sample based on the partial encyclopedic webpage data comprises:
extracting preset information from each encyclopedia webpage in the part of encyclopedia webpage data;
classifying each encyclopedia webpage based on the preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer;
and constructing the second training sample based on the M-class encyclopedia webpage data.
Preferably, the preset information includes:
one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles.
Preferably, the constructing the second training sample based on the M-class encyclopedia webpage data includes:
extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer;
and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are the second training samples.
Preferably, the constructing the first training sample based on the U-class encyclopedia webpage data includes:
generating feature statement vectors corresponding to each type of encyclopedic web pages based on each type of encyclopedic web pages in the U-type encyclopedic web page data, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U-type encyclopedic web pages one to one, and the U feature statement vectors are the first training samples.
Preferably, the generating a feature statement vector corresponding to each type of encyclopedic web page based on each type of encyclopedic web page in the U-type encyclopedic web page data includes:
extracting feature sentences from first-class encyclopedia webpages, wherein the feature sentences comprise entity words and hypernyms, and the first-class encyclopedia webpages are any one of the U-class encyclopedia webpages;
marking the positions of the entity words and the hypernyms in the characteristic sentences;
and generating a feature statement vector corresponding to the first type of encyclopedia webpage based on the marked feature statement.
Preferably, in the encyclopedia webpage of the first type, extracting feature sentences comprises:
extracting the vocabulary entry abstract in the first encyclopedia webpage;
performing sentence segmentation on the entry abstract;
and screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are the characteristic sentences.
Preferably, the marking the positions of the entity words and the hypernyms in the feature sentences comprises:
detecting whether the characteristic sentence contains a first preset character and a second preset character;
if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
Preferably, the marking the positions of the entity words and the hypernyms in the feature sentences comprises:
detecting whether the feature sentence contains a third preset character and a fourth preset character;
if yes, marking words in front of the third preset character in the characteristic sentence as entity word components, and marking words between the third preset character and a fourth preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
Preferably, the marking the positions of the entity words and the hypernyms in the feature sentences comprises:
and marking the positions of the entity words and the hypernyms in the characteristic sentences based on the regular expressions.
Preferably, the training a first deep neural network model based on the first training sample includes:
extracting each feature statement in each feature statement vector in the U feature statement vectors;
extracting the entity words and the hypernyms from each feature sentence based on the positions of the entity words and the hypernyms in the feature sentences;
generating U entity words and hypernym vectors based on the extracted entity words and hypernyms;
and taking the U feature statement vectors as input data of the first deep neural network model, taking the U entity words and the hypernym vectors as output data of the first deep neural network model, and training the first deep neural network model.
Preferably, the extracting entity words and hypernyms from the second webpage data by using the first deep neural network model includes:
extracting the text content in the second webpage;
sentence segmentation is carried out on the text content in the second webpage to obtain L sentences, wherein L is a positive integer;
and sequentially inputting the L sentences into the first deep neural network model, thereby extracting entity words and hypernyms from second webpage data.
In another aspect of the present invention, an apparatus for extracting entity words and hypernyms is provided, including:
the construction unit is used for constructing a first training sample based on the first webpage data;
a training unit, configured to train a first deep neural network model based on the first training sample;
and the extraction unit is used for extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words.
Preferably, the first webpage data is encyclopedic webpage data.
Preferably, the construction unit includes:
the classification subunit is used for classifying the encyclopedic webpage data to obtain U-class encyclopedic webpage data, wherein U is a positive integer;
and the constructing subunit is used for constructing the first training sample based on the U-class encyclopedia webpage data.
Preferably, the classification subunit is specifically configured to:
extracting partial encyclopedic webpage data from the encyclopedic webpage data; constructing a second training sample based on the partial encyclopedic webpage data; training a second deep neural network model based on the second training samples; and classifying the encyclopedic webpage data by utilizing the second deep neural network model to obtain the U-type encyclopedic webpage data.
Preferably, the classification subunit is specifically configured to:
extracting preset information from each encyclopedia webpage in the part of encyclopedia webpage data; classifying each encyclopedia webpage based on the preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer; and constructing the second training sample based on the M-class encyclopedia webpage data.
Preferably, the preset information includes:
one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles.
Preferably, the classification subunit is specifically configured to:
extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer; and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are the second training samples.
Preferably, the structuring subunit is specifically configured to:
generating feature statement vectors corresponding to each type of encyclopedic web pages based on each type of encyclopedic web pages in the U-type encyclopedic web page data, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U-type encyclopedic web pages one to one, and the U feature statement vectors are the first training samples.
Preferably, the structuring subunit is specifically configured to:
extracting feature sentences from first-class encyclopedia webpages, wherein the feature sentences comprise entity words and hypernyms, and the first-class encyclopedia webpages are any one of the U-class encyclopedia webpages; marking the positions of the entity words and the hypernyms in the characteristic sentences; and generating a feature statement vector corresponding to the first type of encyclopedia webpage based on the marked feature statement.
Preferably, the structuring subunit is specifically configured to:
extracting the vocabulary entry abstract in the first encyclopedia webpage; performing sentence segmentation on the entry abstract; and screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are the characteristic sentences.
Preferably, the structuring subunit is specifically configured to:
detecting whether the characteristic sentence contains a first preset character and a second preset character; if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
Preferably, the structuring subunit is specifically configured to:
detecting whether the feature sentence contains a third preset character and a fourth preset character; if yes, marking words in front of the third preset character in the characteristic sentence as entity word components, and marking words between the third preset character and a fourth preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
Preferably, the structuring subunit is specifically configured to:
and marking the positions of the entity words and the hypernyms in the characteristic sentences based on the regular expressions.
Preferably, the training unit comprises:
a first extraction subunit, configured to extract each feature statement in each feature statement vector of the U feature statement vectors;
a second extraction subunit, configured to extract, based on positions of the entity words and hypernyms in the feature sentences, the entity words and hypernyms from each feature sentence;
the generating subunit is used for generating U entity words and hypernym vectors based on the extracted entity words and hypernyms;
and the training subunit is used for taking the U feature statement vectors as input data of the first deep neural network model, taking the U entity words and the hypernym vectors as output data of the first deep neural network model, and training the first deep neural network model.
Preferably, the extraction unit includes:
the third extraction subunit is used for extracting the text content in the second webpage;
a dividing subunit, configured to perform sentence division on the text content in the second web page to obtain L sentences, where L is a positive integer;
and the input subunit is used for sequentially inputting the L sentences into the first deep neural network model so as to extract entity words and hypernyms from the second webpage data.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
in the embodiment of the invention, a method for extracting entity words and hypernyms is disclosed, which comprises the following steps: constructing a first training sample based on the first webpage data; training a first deep neural network model based on the first training samples; and extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words. The invention solves the technical problem of low efficiency in extracting the entity words and the hypernyms from the webpage information in the prior art, and realizes the technical effect of efficiently extracting the entity words and the hypernyms from the webpage information.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flow diagram illustrating a method of extracting entity words and hypernyms according to one embodiment of the invention;
fig. 2 is a block diagram illustrating an apparatus for extracting entity words and hypernyms according to an embodiment of the present invention;
FIG. 3 shows a schematic diagram of an encyclopedia web page in accordance with one embodiment of the invention.
Detailed Description
The embodiment of the invention provides a method and a device for extracting entity words and hypernyms, which are used for solving the technical problem of low efficiency in extracting the entity words and hypernyms from webpage information in the prior art.
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
The embodiment provides a method for extracting entity words and hypernyms, as shown in fig. 1, including:
step S101: based on the first web page data, a first training sample is constructed.
In a specific implementation process, the first webpage data is encyclopedic webpage data.
For example, the first web page data may be "360 encyclopedia" web page data. The '360 encyclopedias' is a network encyclopedia, covers a large amount of knowledge fields, contains over 800 ten thousand entries, and the '360 encyclopedias' webpage is rich structured information edited manually, so that a high-quality mining corpus is provided for people. As shown in fig. 3, the web page information of one entry "horizontal tiger dragon" in the "360 encyclopedia" web page is given. The implementation mainly utilizes the 360 encyclopedic webpage data to train a first deep neural network model, and utilizes the first deep neural network model to mine and extract entity words and hypernyms.
As an alternative embodiment, step S101 includes: classifying the encyclopedia webpage data to obtain U-type encyclopedia webpage data, wherein U is a positive integer; and constructing a first training sample based on the U-type encyclopedia webpage data.
In a specific implementation process, since encyclopedic web pages of the same category generally have the same or similar characteristics, and the distribution positions of entity words and hypernyms have the same or similar rules, in order to improve the extraction efficiency of the entity words and hypernyms, encyclopedic web page data can be firstly classified, and then a first training sample is constructed based on the classified encyclopedic web page data.
In the specific implementation process, encyclopedic webpage data can be divided into the following categories: the video category, the book category, the character category, the place category, the company category, the game category, the school category, the myth story category, the website category, the animation category, the plant category, the country category, the disease category, the food category, the magazine category, the animal category, the language category, the station category, the idiom category, and the like. Here, the encyclopedic web page data is classified, that is, the entries on the encyclopedic web page are classified.
In a specific implementation process, the classifying encyclopedic webpage data to obtain U-class encyclopedic webpage data includes: extracting partial encyclopedic webpage data from encyclopedic webpage data; constructing a second training sample based on part of encyclopedic webpage data; training a second deep neural network model based on a second training sample; and classifying the encyclopedic webpage data by using the second deep neural network model to obtain U-type encyclopedic webpage data.
For example, a part of encyclopedic web page data may be extracted from all encyclopedic web page data to train the second deep neural network model, and then all encyclopedic web page data may be classified by the second deep neural network model. Here, the more the extracted encyclopedic webpage data, the better the obtained second training sample effect, and the higher the classification accuracy of the finally trained second deep neural network model.
In a specific implementation process, the constructing a second training sample based on part of encyclopedic webpage data includes: extracting preset information from each encyclopedia webpage in part of encyclopedia webpage data; classifying each encyclopedia webpage in the encyclopedia webpage data based on preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer; and constructing a second training sample based on the M-type encyclopedia webpage data.
The preset information comprises feature words used for representing categories of encyclopedic webpages.
For example, the film-television encyclopedia web page corresponds to characteristic words such as "movie", "director", "lead actor", "drama", "show time", "film length", "dialogue", "production", "scenario", "actor", "character", and the like; the game encyclopedia webpage is correspondingly provided with characteristic words such as ' game ', ' online game ', ' single machine ', ' electronic competition ', ' player ', ' game equipment ', ' game event ', ' strange play ', ' copy brushing ', main line task ' and the like; the book encyclopedia webpage is correspondingly provided with characteristic words such as 'author', 'publishing house', 'autobiography', 'novel', 'book name', 'book', 'literature', 'framing', 'printed page', 'catalog', and the like; the school encyclopedia webpage is correspondingly provided with characteristic words such as "school", "university", "middle school", "primary school", "specialty", "this department", "education department", "doctor point", "master point", "colleges", "school district", "school address", "school song", "school training", "student admission", "teaching", "teacher strength", "admission score", "scientific research", "learning", "schoolmate", "discipline", "educational administration", "student meeting", "teacher", and the like. By identifying these feature words, it is helpful to determine the category of the encyclopedia web page.
In a specific implementation process, the preset information includes: one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles. Among these information, there are usually feature words that can represent the category of the web page. For example, as shown in fig. 3, for an encyclopedia page of the entry "horizontal tiger dragon", the entry subtitle contains the feature word "movie", the entry abstract contains the feature words "movie", etc., the entry information box contains the feature words "director", "drama", "showing time", "movie length", "dialogue", "movie" etc., and the entry segment header contains the feature words "drama", "actor", "character", "movie", etc., and by recognizing these feature words, it is helpful to determine that the encyclopedia page belongs to the movie category.
In a specific implementation process, the constructing a second training sample based on the M-class encyclopedia webpage data includes: extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer; and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are second training samples.
For example, after classifying part of hundred webpage data to obtain M types of encyclopedia webpage data, for each type of encyclopedia webpage data, all feature words corresponding to the encyclopedia webpage data can be extracted, then the weight of each feature word is calculated by using a TF-IDF (term frequency-inverse document frequency) algorithm, each feature word is ranked according to the weight, and then N feature words ranked at the top are screened out. The larger the weight of the feature word is, the higher the accuracy rate when determining the category of the encyclopedic webpage based on the feature word is. Here, for each type of hundred web page data, feature words with large weights need to be screened out, and feature words with small weights need to be eliminated. The value of N may be set according to an actual situation, where the value range given here is 50 to 250, for example, N may be 50, or 100, or 150, or 200, or 250, and so on.
For example, Word2vec may be selected to train the M N-dimensional feature Word vectors. Word2vec is an efficient tool for representing words as real-valued vectors, and the processing of text contents can be simplified into vector operation in a multidimensional vector space through training by utilizing the idea of deep learning, and the similarity in the vector space can be used for representing the similarity in text semantics.
In a specific implementation process, the second training sample includes the M N-dimensional feature word vectors and category information of the M types of encyclopedia web page data, where the M N-dimensional feature word vectors correspond to the category information of the M types of encyclopedia web page data one to one.
After obtaining the second training sample, the second deep neural network model may be trained based on the second training sample.
In a specific implementation process, the second deep Neural network model may adopt a CNN (Convolutional Neural network) model provided by the options platform, so that the generalization capability and the expandability of the second deep Neural network model are improved. Among them, OPTIMUS is an excellent process integration and optimization design platform. The method is characterized by comprising process integration and optimization design software, a CAD/CAE simulation tool is integrated, simulation process automation is realized, modules such as test design, single-target/multi-target optimization, robustness/reliability design and the like are included, and the CAD/CAE simulation tool is an auxiliary tool for multidisciplinary simulation design.
In a specific implementation process, when a second deep neural network model is trained based on a second training sample, the M N-dimensional feature word vectors may be used as standard inputs of the second deep neural network model, and category information of the M-class encyclopedic web page data may be used as standard outputs of the second deep neural network model, so as to train the second deep neural network model. The trained second deep neural network model has the capability of classifying any encyclopedic webpage, and the encyclopedic webpage can be classified according to the token vector formed by the token words on any encyclopedic webpage.
As an alternative embodiment, the second training samples may be divided into multiple parts, each part of the training samples is used to train a small CNN model to obtain multiple small CNN models, and finally, all the second training samples are used to train a large CNN model. After the training of the second deep neural network model is completed, all encyclopedic webpage data can be classified by using the second deep neural network model, and therefore the U-type encyclopedic webpage data are obtained. For example, 800 million webpages in "360 encyclopedia" can be classified and predicted, and each encyclopedia webpage passes through the small CNN model and the large CNN model in turn, so that each encyclopedia webpage is classified, wherein one encyclopedia webpage can have a plurality of categories.
After all encyclopedic webpage data are classified, clustering the encyclopedic webpages according to the classified categories to obtain the U-type encyclopedic webpage data.
After the U-type encyclopedia webpage data is obtained, a first training sample can be constructed based on the U-type encyclopedia webpage data.
As an alternative embodiment, the constructing a first training sample based on the U-class encyclopedia webpage data includes: generating feature statement vectors corresponding to each type of encyclopedic web pages based on each type of encyclopedic web pages in U-type encyclopedic web page data, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U-type encyclopedic web pages one to one, and the U feature statement vectors are first training samples.
Wherein, generating a feature statement vector corresponding to each type of encyclopedia webpage based on each type of encyclopedia webpage in the U-type encyclopedia webpage data comprises: extracting characteristic sentences from first-class encyclopedia webpages, wherein the characteristic sentences comprise entity words and hypernyms, and the first-class encyclopedia webpages are any one of U-class encyclopedia webpages; marking the positions of the entity words and the hypernyms in the characteristic sentences; and generating a characteristic statement vector corresponding to the first type of encyclopedia webpage based on the marked characteristic statement. Thus, feature statement vectors corresponding to the U-type encyclopedia webpage data are obtained. Each feature statement vector comprises a plurality of feature statements, and the positions of entity words and hypernyms are marked in each feature statement.
In the specific implementation process, after encyclopedic webpages are clustered according to classification categories, a phenomenon can be found: in some categories, the term abstract contains the superior word of the term heading in the term heading sentence. It can be seen that the positions of the entity words and hypernyms in the sentence are characterized by significant patterns.
In a specific implementation process, when characteristic sentences are extracted from the first-class encyclopedic web pages, entry abstracts in the first-class encyclopedic web pages can be extracted; performing sentence segmentation on the entry abstract; and screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are characteristic sentences. Wherein, when the vocabulary entry abstract is divided into sentences, the vocabulary entry abstract can be divided into "! ","? "". "three punctuation cutting statements.
For example, as shown in table 1, the left column is a feature sentence containing a term title in the term abstract, and the right column is a corresponding entity word and a hypernym. The entry abstract of the entry 'ancient war era' contains a characteristic sentence 'ancient war era' which is a mythical war game taking an instant battle mode as a core. "including the physical word" ancient war "and the hypernym" mythical war game "; the characteristic sentence "how much and garlat is a football player in Spain, and what he kicks is the front" is contained in the entry abstract of the entry "how much and garlat", which contains the entity word "how much and garlat" and the superior word "football player"; the characteristic sentence 'joint-remembered cookie' is contained in the entry abstract of the entry 'joint-remembered cookie', and is a cake making store in the city of Fushan, wherein the entry abstract comprises an entity word 'joint-remembered cookie' and a superior word 'cake making store'; the entry abstract of the entry 'orchid smile' contains a characteristic sentence, 'orchid smile' is a short-cut novel of an inspirational book and explains the diligent effort of the mother cymbidium, wherein the characteristic sentence 'orchid smile' contains an entity word 'orchid smile' and a superior word 'short-cut novel'. It can be seen that entity words and hypernyms are contained in such feature sentences, and the mining and extraction of entity words and hypernyms can be performed based on such feature sentences.
Figure BDA0001197338010000131
Figure BDA0001197338010000141
TABLE 1
In a specific implementation process, the marking of the positions of the entity words and the hypernyms in the feature sentences comprises the following two implementation modes:
the first method is as follows: detecting whether the characteristic sentence contains a first preset character and a second preset character; if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form entity words, and the words marked as the hypernym components form hypernyms.
When the characteristic sentence is marked with components, e represents an entity word component, h represents an hypernym component, o represents other components, and in general, punctuation marks in the characteristic sentence are marked with other components.
Specifically, the first preset character may be "yes" and the second preset character may be "yes", that is, for the sentence pattern "a is B of … …", if a belongs to the entry title, a belongs to the entity word, and B is the hypernym.
For example, the characteristic sentence "ancient war era" is a mythical war game with the instant fighting mode as the core. ", each word therein may be component labeled, with the labeling results as follows:
[ solution: o in ancient times: e, war era: e (e): o is: o one pattern: o is as follows: o instant: o battle: an o mode: o is: o core: o is: o myth: h war games: h. : o
Wherein, the words "ancient" and "war" marked as e together constitute the entity word "ancient war", and the words "myth" and "war game" marked as h together constitute the superordinate word "myth war game".
In addition, the first preset character may be "at", and the second preset character may be "no", that is, for the sentence pattern of "B at … … a", if a belongs to the entry title, a belongs to the entity word, and B is the hypernym.
For example, for the characteristic statement "gold economic center of cigarette table city of Shandong province in China and Hotel space of cigarette table, wholesale market of the largest small goods in Jiaodong", each word in the characteristic statement can be marked with components, and the marking result is as follows:
a tobacco bench: e Country Hotel: e at: o Shandong province: o tobacco market: o gold economic center: o,: o glue east: o max: o is: o small commercial products: h wholesale market: h'
Wherein, the words "cigarette counter" and "national hotel" marked as e together constitute the entity word "cigarette counter national hotel", and the words "small commodity" and "wholesale market" marked as h together constitute the superior word "small commodity wholesale market".
The second method comprises the following steps: detecting whether the feature sentence contains a third preset character and a fourth preset character; if yes, the words in front of the third preset character in the characteristic sentence are marked as entity word components, and the words between the third preset character and the fourth preset character in the characteristic sentence are marked as hypernym components, wherein the words marked as the entity word components form entity words, and the words marked as the hypernym components form hypernyms.
Specifically, the third preset character may be "yes" and the fourth preset character may be "one", that is, for a sentence pattern in which "a is one of B", if a belongs to the entry heading, a belongs to the entity word, and B is the hypernym.
For example, for a characteristic statement that "fat meat paste is one of hot dish recipes, fat meat and sesame are used as main ingredients for making", each term can be marked with components, and the marking result is as follows:
meat paste: e is: o-hot dish: h, recipe: one of h: o,: o is as follows: and o fat meat: o,: o sesame: o is: and o preparation: o major ingredient o
Wherein, the word "fat meat paste" marked as e is a solid word, and the words "hot dish" and "recipe" marked as h together form the superior word "hot dish recipe".
Here, in order to improve the labeling efficiency, positions of the entity words and the hypernyms in the feature sentences may be determined based on the regular expression, and then the labeling may be performed. For example:
for feature sentences having a sentence pattern of "a is B of … …," the following regular expression may be utilized:
regx ═ u' ([ < u4e00- \ u9fa 5- · - ] {1, }) is ([ \\ u4e00- \ u9fa5\ w \ s- ] {1, }, [ \ u4e00- \ u9fa5 "" ] {1, }. ]*'
For feature sentences having a sentence pattern of "B at … … a", the following regular expression may be utilized:
the method is characterized in that the method comprises the steps of: regx [, \ s ] ([ \ u4e00- \ u9fa 5- "" \ w () \ s () ", and {1, }) and [ \ u4e00- \ u9fa5\ w, ] {1, }, [ \\ u4e00- \\ u9fa5," "," {1, }) at the ground. (ii) a ]*'
For feature sentences having a sentence pattern of "A is one of B", the following regular expression can be utilized:
the regx [ \ u4e00- \ u9fa5 [ \ u4e00- \\ u9fa5\ s () ] {1, }) is one of ([ \ u4e00- \ u9fa5\ w \ s, ] {1, }) and [ \ u4e00- \ u9fa 5. (ii) a ]*'
Based on the above method, a first training sample can be obtained, where the first training sample is specifically the U feature sentence vectors, each feature sentence vector includes a plurality of feature sentences, and the positions of the entity words and the hypernyms are marked in each feature sentence.
Step S102: based on the first training sample, a first deep neural network model is trained.
As an alternative embodiment, step S102 includes: extracting each feature statement in each feature statement vector in the U feature statement vectors; extracting entity words and hypernyms from each characteristic sentence based on the positions of the entity words and hypernyms in each characteristic sentence; generating U entity words and hypernym vectors based on the extracted entity words and hypernyms; and taking the U feature statement vectors as standard input data of a first deep neural network model, taking the U entity words and the hypernym vectors as standard output data of the first deep neural network model, and training the first deep neural network model.
In a specific implementation process, each feature statement vector comprises K feature statements, each entity word and hypernym vector comprises K pairs of entity words and hypernyms, K is a positive integer, the K feature statements and the K pairs of entity words and hypernyms are in one-to-one correspondence, and the U feature statement vectors and the U entity words and hypernym vectors are in one-to-one correspondence. Therefore, when the first deep neural network model obtains one feature sentence from the U feature sentence vectors, the entity words and the hypernyms corresponding to the feature sentence can be correspondingly obtained from the U entity words and the hypernym vectors, and therefore learning of the feature sentence and the corresponding entity words and hypernyms is completed.
In a specific implementation process, the first deep Neural Network model may adopt a Bidirectional BLSTM-RNN (Bidirectional Long Short-Term Memory recovery Neural Network) model. The trained first deep neural network model can predict the positions of the entity words and the hypernyms in the sentence according to any input sentence, and finally extracts the entity words and the hypernyms.
Step S103: and extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words.
As an alternative embodiment, step S103 includes: extracting the text content in the second webpage; sentence segmentation is carried out on the text content in the second webpage to obtain L sentences, wherein L is a positive integer; and sequentially inputting the L sentences into the first deep neural network model, thereby extracting entity words and hypernyms from the second webpage data.
In the implementation process, the second web page may be any web page on the network, including encyclopedia web pages, or any other web page (e.g., "360 question and answer" web page, forum web page, etc.). Taking encyclopedic web pages as an example, the entity words and hypernyms can be extracted from the vocabulary entry abstract of the encyclopedic web pages, and the entity words and hypernyms can also be extracted from the text.
In a specific implementation process, when the first deep neural network model is used to extract the entity words and hypernyms from the second webpage data, the word content in the second webpage needs to be segmented into sentences first, as described above, the words and hypernyms can be extracted according to the following general formula! ","? "". The sentence vector composed of the divided sentences is input into the first deep neural network model, and the output of the first deep neural network model is also a vector which comprises the extracted paired entity words and hypernyms.
In the specific implementation process, in step S103, a large number of entity words and hypernyms can be obtained through the first deep neural network model, where the confidence of each pair of entity words and hypernyms can be further calculated, and entity words and hypernyms with the confidence lower than a certain threshold are filtered out, and entity words and hypernyms with higher confidence are retained, so as to further improve the accuracy of extracting entity words and hypernyms.
The technical scheme in the embodiment of the application at least has the following technical effects or advantages:
in the embodiment of the invention, a method for extracting entity words and hypernyms is disclosed, which comprises the following steps: constructing a first training sample based on the first webpage data; training a first deep neural network model based on the first training samples; and extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words. The invention solves the technical problem of low efficiency in extracting the entity words and the hypernyms from the webpage information in the prior art, and realizes the technical effect of efficiently extracting the entity words and the hypernyms from the webpage information.
Example two
Based on the same inventive concept, this embodiment provides an apparatus for extracting a physical word and a hypernym, as shown in fig. 2, including:
a constructing unit 201, configured to construct a first training sample based on the first web page data;
a training unit 202, configured to train a first deep neural network model based on the first training sample;
an extracting unit 203, configured to extract, by using the first deep neural network model, a real word and a hypernym from second webpage data, where the second webpage data includes the first webpage data, and the hypernym corresponds to the real word.
As an alternative embodiment, the first web page data is encyclopedic web page data.
As an alternative embodiment, the construction unit 201 includes:
the classification subunit is used for classifying the encyclopedic webpage data to obtain U-class encyclopedic webpage data, wherein U is a positive integer;
and the constructing subunit is used for constructing the first training sample based on the U-class encyclopedia webpage data.
As an optional embodiment, the classification subunit is specifically configured to:
extracting partial encyclopedic webpage data from the encyclopedic webpage data; constructing a second training sample based on the partial encyclopedic webpage data; training a second deep neural network model based on the second training samples; and classifying the encyclopedic webpage data by utilizing the second deep neural network model to obtain the U-type encyclopedic webpage data.
As an optional embodiment, the classification subunit is specifically configured to:
extracting preset information from each encyclopedia webpage in the part of encyclopedia webpage data; classifying each encyclopedia webpage based on the preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer; and constructing the second training sample based on the M-class encyclopedia webpage data.
As an optional embodiment, the preset information includes:
one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles.
As an optional embodiment, the classification subunit is specifically configured to:
extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer; and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are the second training samples.
As an alternative embodiment, the subunit is constructed, in particular for:
generating feature statement vectors corresponding to each type of encyclopedic web pages based on each type of encyclopedic web pages in the U-type encyclopedic web page data, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U-type encyclopedic web pages one to one, and the U feature statement vectors are the first training samples.
As an alternative embodiment, the subunit is constructed, in particular for:
extracting feature sentences from first-class encyclopedia webpages, wherein the feature sentences comprise entity words and hypernyms, and the first-class encyclopedia webpages are any one of the U-class encyclopedia webpages; marking the positions of the entity words and the hypernyms in the characteristic sentences; and generating a feature statement vector corresponding to the first type of encyclopedia webpage based on the marked feature statement.
As an alternative embodiment, the subunit is constructed, in particular for:
extracting the vocabulary entry abstract in the first encyclopedia webpage; performing sentence segmentation on the entry abstract; and screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are the characteristic sentences.
As an alternative embodiment, the subunit is constructed, in particular for:
detecting whether the characteristic sentence contains a first preset character and a second preset character; if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
As an alternative embodiment, the subunit is constructed, in particular for:
detecting whether the feature sentence contains a third preset character and a fourth preset character; if yes, marking words in front of the third preset character in the characteristic sentence as entity word components, and marking words between the third preset character and a fourth preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
As an alternative embodiment, the construction subunit is specifically configured to:
and marking the positions of the entity words and the hypernyms in the characteristic sentences based on the regular expressions.
As an alternative embodiment, the training unit 202 includes:
a first extraction subunit, configured to extract each feature statement in each feature statement vector of the U feature statement vectors;
a second extraction subunit, configured to extract, based on positions of the entity words and hypernyms in the feature sentences, the entity words and hypernyms from each feature sentence;
the generating subunit is used for generating U entity words and hypernym vectors based on the extracted entity words and hypernyms;
and the training subunit is used for taking the U feature statement vectors as standard input data of the first deep neural network model, taking the U entity words and the hypernym vectors as standard output data of the first deep neural network model, and training the first deep neural network model.
As an alternative embodiment, the extracting unit 203 includes:
the third extraction subunit is used for extracting the text content in the second webpage;
a dividing subunit, configured to perform sentence division on the text content in the second web page to obtain L sentences, where L is a positive integer;
and the input subunit is used for sequentially inputting the L sentences into the first deep neural network model so as to extract entity words and hypernyms from the second webpage data.
The technical scheme in the embodiment of the application at least has the following technical effects or advantages:
since the apparatus for extracting entity words and hypernyms described in this embodiment is an apparatus used for implementing the method for extracting entity words and hypernyms described in this embodiment, based on the method for extracting entity words and hypernyms described in this embodiment, those skilled in the art can understand the specific implementation manner and various variations of the apparatus for extracting entity words and hypernyms described in this embodiment, and therefore, how the apparatus for extracting entity words and hypernyms implements the method in this embodiment is not described in detail here. All devices used by those skilled in the art to implement the methods for extracting entity words and hypernyms in the embodiments of the present application are within the scope of the present application.
The technical scheme in the embodiment of the application at least has the following technical effects or advantages:
in the embodiment of the invention, the invention discloses a device for extracting entity words and hypernyms, which comprises the following steps: the construction unit is used for constructing a first training sample based on the first webpage data; a training unit, configured to train a first deep neural network model based on the first training sample; and the extraction unit is used for extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words. The invention solves the technical problem of low efficiency in extracting the entity words and the hypernyms from the webpage information in the prior art, and realizes the technical effect of efficiently extracting the entity words and the hypernyms from the webpage information.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of an apparatus for extracting entity words and hypernyms according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (14)

1. A method for extracting entity words and hypernyms, comprising:
constructing a first training sample based on the first webpage data; which comprises the following steps:
classifying the encyclopedia webpage data to obtain U-type encyclopedia webpage data, wherein U is a positive integer; the first webpage data are encyclopedic webpage data; extracting entry abstracts in the encyclopedic webpages of the first class, and performing sentence segmentation on the entry abstracts; screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are characteristic sentences, the characteristic sentences contain entity words and hypernyms, and the first-class encyclopedic web pages are any one of the U-class encyclopedic web pages; marking the positions of the entity words and the hypernyms in the characteristic sentences; generating feature statement vectors corresponding to the first type of encyclopedic web pages based on the marked feature statements, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U type of encyclopedic web pages one by one, and the U feature statement vectors are first training samples;
training a first deep neural network model based on the first training samples;
extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words;
the classifying the encyclopedic webpage data to obtain U-type encyclopedic webpage data comprises the following steps:
extracting partial encyclopedic webpage data from the encyclopedic webpage data;
constructing a second training sample based on the partial encyclopedic webpage data;
training a second deep neural network model based on the second training samples;
classifying the encyclopedic webpage data by using the second deep neural network model to obtain U-type encyclopedic webpage data;
wherein the constructing a second training sample based on the portion of encyclopedic web page data comprises:
extracting preset information from each encyclopedia webpage in the part of encyclopedia webpage data;
classifying each encyclopedia webpage based on the preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer;
constructing the second training sample based on the M-class encyclopedia webpage data;
wherein, the preset information includes:
one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles.
2. The method of claim 1, wherein constructing the second training sample based on the M-class encyclopedia web page data comprises:
extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer;
and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are the second training samples.
3. The method for extracting entity words and hypernyms according to claim 1 or 2, wherein the marking of the positions of the entity words and hypernyms in the feature sentences comprises:
detecting whether the characteristic sentence contains a first preset character and a second preset character;
if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
4. The method of claim 3, wherein the marking the position of the entity words and hypernyms in the feature sentences comprises:
detecting whether the feature sentence contains a third preset character and a fourth preset character;
if yes, marking words in front of the third preset character in the characteristic sentence as entity word components, and marking words between the third preset character and a fourth preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
5. The method of claim 4, wherein the marking the position of the entity words and hypernyms in the feature sentences comprises:
and marking the positions of the entity words and the hypernyms in the characteristic sentences based on the regular expressions.
6. The method of claim 5, wherein training a first deep neural network model based on the first training sample comprises:
extracting each feature statement in each feature statement vector in the U feature statement vectors;
extracting the entity words and the hypernyms from each feature sentence based on the positions of the entity words and the hypernyms in the feature sentences;
generating U entity words and hypernym vectors based on the extracted entity words and hypernyms;
and taking the U feature statement vectors as input data of the first deep neural network model, taking the U entity words and the hypernym vectors as output data of the first deep neural network model, and training the first deep neural network model.
7. The method of claim 6, wherein extracting the entity words and hypernyms from the second webpage data using the first deep neural network model comprises:
extracting the text content in the second webpage;
sentence segmentation is carried out on the text content in the second webpage to obtain L sentences, wherein L is a positive integer;
and sequentially inputting the L sentences into the first deep neural network model, thereby extracting entity words and hypernyms from second webpage data.
8. An apparatus for extracting entity words and hypernyms, comprising:
the construction unit is used for constructing a first training sample based on the first webpage data; it is also specifically used for:
classifying the encyclopedia webpage data to obtain U-type encyclopedia webpage data, wherein U is a positive integer; the first webpage data are encyclopedic webpage data; extracting entry abstracts in the encyclopedic webpages of the first class, and performing sentence segmentation on the entry abstracts; screening out sentences containing entry titles from the segmented sentences, wherein the sentences containing the entry titles are characteristic sentences, the characteristic sentences contain entity words and hypernyms, and the first-class encyclopedic web pages are any one of the U-class encyclopedic web pages; marking the positions of the entity words and the hypernyms in the characteristic sentences; generating feature statement vectors corresponding to the first type of encyclopedic web pages based on the marked feature statements, and obtaining U feature statement vectors, wherein the U feature statement vectors correspond to the U type of encyclopedic web pages one by one, and the U feature statement vectors are first training samples;
a training unit, configured to train a first deep neural network model based on the first training sample;
the extraction unit is used for extracting entity words and hypernyms from second webpage data by using the first deep neural network model, wherein the second webpage data comprise the first webpage data, and the hypernyms correspond to the entity words;
the classifying the encyclopedic webpage data to obtain U-type encyclopedic webpage data comprises the following steps:
extracting partial encyclopedic webpage data from the encyclopedic webpage data; constructing a second training sample based on the partial encyclopedic webpage data; training a second deep neural network model based on the second training samples; classifying the encyclopedic webpage data by using the second deep neural network model to obtain U-type encyclopedic webpage data;
wherein the constructing a second training sample based on the portion of encyclopedic web page data comprises:
extracting preset information from each encyclopedia webpage in the part of encyclopedia webpage data; classifying each encyclopedia webpage based on the preset information to obtain M types of encyclopedia webpage data, wherein M is a positive integer; constructing the second training sample based on the M-class encyclopedia webpage data;
wherein, the preset information includes:
one or more of entry titles, entry subtitles, entry abstracts, information in entry information frames and entry segmentation titles.
9. The apparatus for extracting entity words and hypernyms of claim 8, wherein the constructing the second training sample based on the M-class encyclopedia web page data comprises:
extracting a group of feature words from each of the M types of encyclopedia webpages to obtain M groups of feature words, wherein each group of feature words in the M groups of feature words comprises N feature words, the feature words are used for representing the categories of the encyclopedia webpages, and N is a positive integer; and generating M N-dimensional feature word vectors based on the M groups of feature words, wherein the M N-dimensional feature word vectors are the second training samples.
10. The apparatus for extracting entity words and hypernyms according to claim 8 or 9, wherein the marking of the positions of the entity words and hypernyms in the feature sentences comprises:
detecting whether the characteristic sentence contains a first preset character and a second preset character; if yes, marking words in front of the first preset character in the characteristic sentence as entity word components, and marking words behind the second preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
11. The apparatus for extracting entity words and hypernyms of claim 10, wherein the construction subunit is specifically configured to:
detecting whether the feature sentence contains a third preset character and a fourth preset character; if yes, marking words in front of the third preset character in the characteristic sentence as entity word components, and marking words between the third preset character and a fourth preset character in the characteristic sentence as hypernym components, wherein the words marked as the entity word components form the entity words, and the words marked as the hypernym components form the hypernym.
12. The apparatus for extracting entity words and hypernyms of claim 11, wherein the construction subunit is specifically configured to:
and marking the positions of the entity words and the hypernyms in the characteristic sentences based on the regular expressions.
13. The apparatus for extracting entity words and hypernyms of claim 12, wherein the training unit comprises:
a first extraction subunit, configured to extract each feature statement in each feature statement vector of the U feature statement vectors;
a second extraction subunit, configured to extract, based on positions of the entity words and hypernyms in the feature sentences, the entity words and hypernyms from each feature sentence;
the generating subunit is used for generating U entity words and hypernym vectors based on the extracted entity words and hypernyms;
and the training subunit is used for taking the U feature statement vectors as input data of the first deep neural network model, taking the U entity words and the hypernym vectors as output data of the first deep neural network model, and training the first deep neural network model.
14. The apparatus for extracting entity words and hypernyms of claim 13, wherein the extracting unit comprises:
the third extraction subunit is used for extracting the text content in the second webpage;
a dividing subunit, configured to perform sentence division on the text content in the second web page to obtain L sentences, where L is a positive integer;
and the input subunit is used for sequentially inputting the L sentences into the first deep neural network model so as to extract entity words and hypernyms from the second webpage data.
CN201611247066.6A 2016-12-29 2016-12-29 Method and device for extracting entity words and hypernyms Active CN106649819B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611247066.6A CN106649819B (en) 2016-12-29 2016-12-29 Method and device for extracting entity words and hypernyms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611247066.6A CN106649819B (en) 2016-12-29 2016-12-29 Method and device for extracting entity words and hypernyms

Publications (2)

Publication Number Publication Date
CN106649819A CN106649819A (en) 2017-05-10
CN106649819B true CN106649819B (en) 2021-04-02

Family

ID=58835821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611247066.6A Active CN106649819B (en) 2016-12-29 2016-12-29 Method and device for extracting entity words and hypernyms

Country Status (1)

Country Link
CN (1) CN106649819B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019648B (en) * 2017-12-05 2021-02-02 深圳市腾讯计算机系统有限公司 Method and device for training data and storage medium
CN108038200A (en) * 2017-12-12 2018-05-15 北京百度网讯科技有限公司 Method and apparatus for storing data
CN108304501B (en) * 2018-01-17 2020-09-04 腾讯科技(深圳)有限公司 Invalid hypernym filtering method and device and storage medium
CN110059310B (en) * 2018-01-19 2022-10-28 腾讯科技(深圳)有限公司 Hypernym network expansion method and device, electronic equipment and storage medium
CN108280482B (en) * 2018-01-30 2020-10-16 广州小鹏汽车科技有限公司 Driver identification method, device and system based on user behaviors
CN110196982B (en) * 2019-06-12 2022-12-27 腾讯科技(深圳)有限公司 Method and device for extracting upper-lower relation and computer equipment
CN112560471A (en) * 2019-09-26 2021-03-26 北京国双科技有限公司 Method and system for acquiring related words of professional words
US11501070B2 (en) 2020-07-01 2022-11-15 International Business Machines Corporation Taxonomy generation to insert out of vocabulary terms and hypernym-hyponym pair induction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN104809176A (en) * 2015-04-13 2015-07-29 中央民族大学 Entity relationship extracting method of Zang language
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778238B (en) * 2014-01-27 2015-03-04 西安交通大学 Method for automatically building classification tree from semi-structured data of Wikipedia
US9846836B2 (en) * 2014-06-13 2017-12-19 Microsoft Technology Licensing, Llc Modeling interestingness with deep neural networks
CN105808525B (en) * 2016-03-29 2018-06-29 国家计算机网络与信息安全管理中心 A kind of field concept hyponymy abstracting method based on similar concept pair
CN106055675B (en) * 2016-06-06 2019-10-29 杭州量知数据科技有限公司 A kind of Relation extraction method based on convolutional neural networks and apart from supervision

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034693A (en) * 2012-12-03 2013-04-10 哈尔滨工业大学 Open-type entity and type identification method thereof
CN104809176A (en) * 2015-04-13 2015-07-29 中央民族大学 Entity relationship extracting method of Zang language
CN106126512A (en) * 2016-04-13 2016-11-16 北京天融信网络安全技术有限公司 The Web page classification method of a kind of integrated study and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度学习的电子病历中实体关系抽取;吴嘉伟 等;《智能计算机与应用》;20140630;第4卷(第3期);第35-41页 *

Also Published As

Publication number Publication date
CN106649819A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649819B (en) Method and device for extracting entity words and hypernyms
TWI695277B (en) Automatic website data collection method
Brum et al. Building a sentiment corpus of tweets in Brazilian Portuguese
Ardanuy et al. Structure-based clustering of novels
Tahsin Mayeesha et al. Deep learning based question answering system in Bengali
Velldal et al. NoReC: The norwegian review corpus
CN110442841A (en) Identify method and device, the computer equipment, storage medium of resume
JP5587821B2 (en) Document topic extraction apparatus, method, and program
CN107291694A (en) A kind of automatic method and apparatus, storage medium and terminal for reading and appraising composition
Kenny Human and machine translation
Shukla et al. Keyword extraction from educational video transcripts using NLP techniques
Sağlam et al. Developing Turkish sentiment lexicon for sentiment analysis using online news media
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
Karkar et al. Illustrate it! An Arabic multimedia text-to-picture m-learning system
Bleoancă et al. LSI based mechanism for educational videos retrieval by transcripts processing
Šauperl Pinning down a novel: characteristics of literary works as perceived by readers
Park et al. Automatic analysis of thematic structure in written English
CN104778162A (en) Subject classifier training method and system based on maximum entropy
Morie et al. Information extraction model to improve learning game metadata indexing
Viola et al. Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach.
Poornima et al. Text preprocessing on extracted text from audio/video using R
Park et al. Text Processing Education Using a Block-Based Programming Language
CN112989068B (en) Knowledge graph construction method for Tang poetry knowledge and Tang poetry knowledge question-answering system
Otlogetswe Text Variability Measures in Corpus Design for Setswana Lexicography
Lampi Looking behind the text-to-be-seen: Analysing Twitter bots as electronic literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant