CN111950271A - Phrase extraction method and device for unstructured text - Google Patents

Phrase extraction method and device for unstructured text Download PDF

Info

Publication number
CN111950271A
CN111950271A CN201910365420.2A CN201910365420A CN111950271A CN 111950271 A CN111950271 A CN 111950271A CN 201910365420 A CN201910365420 A CN 201910365420A CN 111950271 A CN111950271 A CN 111950271A
Authority
CN
China
Prior art keywords
phrase
word
extraction
text
unstructured text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910365420.2A
Other languages
Chinese (zh)
Inventor
周林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Genius Technology Co Ltd
Original Assignee
Guangdong Genius Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Genius Technology Co Ltd filed Critical Guangdong Genius Technology Co Ltd
Priority to CN201910365420.2A priority Critical patent/CN111950271A/en
Publication of CN111950271A publication Critical patent/CN111950271A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of language processing, and discloses a phrase extraction method and a device for an unstructured text, wherein the method comprises the following steps: generating phrase extraction rules for each phrase type; acquiring an unstructured text; and extracting phrases from the unstructured text according to the phrase extraction rules. The method effectively solves the problem of extracting the phrases of the required types from the unstructured text by establishing the phrase extraction rule, can obtain a large number of phrases for enriching the composition material library, and has high collection efficiency compared with a manual collection mode.

Description

Phrase extraction method and device for unstructured text
Technical Field
The invention belongs to the technical field of language processing, and particularly relates to a phrase extraction method and device for an unstructured text.
Background
In the teaching process of the current Chinese composition, the importance of the composition materials is repeatedly emphasized. The wonderful woman is difficult to cook rice, if the user wants to write the composition, the user needs to have rich composition materials, and the user can write the composition with spirit.
At present, the composition materials are accumulated mainly by reading out-of-class books by students and then recording some phrases in the books so as to be flexibly applied when writing compositions subsequently. However, the books read by students are limited, so that the composition phrase materials accumulated by the students are not enough to support the students to write various types of compositions. Therefore, there is a need to provide students with a library of composition phrase materials that can be learned directly. In the prior art, a method capable of quickly collecting the composition phrase materials does not exist, the composition phrase materials are collected manually, a large amount of manpower and material resources are consumed, and the efficiency is low.
Disclosure of Invention
The invention aims to provide a phrase extraction method and a device for an unstructured text, which effectively solve the problem of extracting phrases of required types from the unstructured text by establishing a phrase extraction rule and have high collection efficiency compared with a manual collection mode.
The technical scheme provided by the invention is as follows:
in one aspect, a phrase extraction method for unstructured text is provided, which includes:
generating phrase extraction rules for each phrase type;
acquiring an unstructured text;
and extracting phrases from the unstructured text according to the phrase extraction rules.
Further preferably, the generating phrase extraction rules for each phrase type specifically includes:
establishing a phrase type library, wherein the phrase type library comprises a plurality of phrase types;
acquiring a training sample set of each phrase type, wherein the training sample set comprises a training text and extracted phrases;
and generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type.
Further preferably, the generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type specifically includes:
segmenting each training text in the training sample set to obtain each word corresponding to each training text, the part of speech of the word and the position sequence of the word;
analyzing and obtaining phrase extraction characteristics corresponding to each phrase type according to the extracted phrases of each training text, wherein the phrase extraction characteristics comprise part-of-speech combination characteristics and word position characteristics;
and generating a phrase extraction rule corresponding to each phrase type according to the obtained extraction features by using a machine learning method.
Further preferably, the method further comprises the following steps:
acquiring basic words;
the extracting phrases from the unstructured text according to the phrase extraction rule specifically includes:
and extracting phrases containing the basic words from the unstructured text according to the phrase extraction rules and the basic words.
Further preferably, the extracting, from the unstructured text, a phrase including the basic word according to the phrase extraction rule and the basic word specifically includes:
finding the base word in the unstructured text;
on the basis of the basic words, finding out target words which accord with extraction characteristics from the unstructured text according to the phrase extraction rules and the parts of speech of the basic words;
and combining the basic words and the target words to obtain phrases containing the basic words.
In another aspect, a phrase extracting apparatus for unstructured text is also provided, including:
the rule generating module is used for generating phrase extraction rules of each phrase type;
the text acquisition module is used for acquiring the unstructured text;
and the phrase extraction module is used for extracting phrases from the unstructured text according to the phrase extraction rules.
Further preferably, the rule generating module includes:
the phrase library establishing unit is used for establishing a phrase type library, and the phrase type library comprises a plurality of phrase types;
the system comprises a sample set acquisition unit, a phrase extraction unit and a phrase extraction unit, wherein the sample set acquisition unit is used for acquiring a training sample set of each phrase type, and the training sample set comprises a training text and extracted phrases;
and the rule generating unit is used for generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type.
Further preferably, the rule generating unit includes:
the word segmentation subunit is used for performing word segmentation on each training text in the training sample set to obtain each word corresponding to each training text, the part of speech of the word and the position sequence of the word;
the feature analysis subunit is used for analyzing and obtaining phrase extraction features corresponding to each phrase type according to the extracted phrases of each training text, wherein the phrase extraction features comprise part-of-speech combination features and word position features;
and the rule generating subunit is used for generating a phrase extraction rule corresponding to each phrase type according to the obtained extraction features by using a machine learning method.
Further preferably, the method further comprises the following steps:
the word acquisition module is used for acquiring basic words;
the phrase extraction module is further configured to extract phrases containing the basic words from the unstructured text according to the phrase extraction rules and the basic words.
Further preferably, the phrase extraction module comprises:
a basic word searching unit, configured to find the basic word in the unstructured text;
the target word searching unit is used for searching a target word which accords with the extraction characteristics from the unstructured text on the basis of the basic word according to the phrase extraction rule and the part of speech of the basic word;
and the word combination unit is used for combining the basic words and the target words to obtain phrases containing the basic words.
Compared with the prior art, the phrase extraction method and the device for the unstructured text have the advantages that: the method effectively solves the problem of extracting the phrases of the required types from the unstructured text by establishing the phrase extraction rule, can obtain a large number of phrases for enriching the composition material library, and has high collection efficiency compared with a manual collection mode.
Drawings
The above features, technical features, advantages and implementations of a method and apparatus for phrase extraction for unstructured text will be further described in the following detailed description of preferred embodiments in a clearly understandable manner, in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of a first embodiment of a phrase extraction method for unstructured text according to the present invention;
FIG. 2 is a flowchart illustrating a phrase extraction method for unstructured text according to a second embodiment of the present invention;
FIG. 3 is a flowchart illustrating a phrase extraction method for unstructured text according to a third embodiment of the present invention;
FIG. 4 is a flowchart illustrating a phrase extraction method for unstructured text according to a fourth embodiment of the present invention;
FIG. 5 is a flow chart of a fifth embodiment of the phrase extraction method for unstructured text according to the present invention;
FIG. 6 is a block diagram illustrating the structure of an embodiment of a phrase extracting apparatus for unstructured text.
Description of the reference numerals
100. A rule generation module; 110. A phrase library establishing unit;
120. a sample set acquisition unit; 130. A rule generating unit;
131. a word segmentation subunit; 132. A feature analysis subunit;
133. a rule generating subunit; 200. A text acquisition module;
300. a phrase extraction module; 310. A basic word searching unit;
320. a target word searching unit; 330. A word combination unit;
400. and a word acquisition module.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
According to a first embodiment provided by the present invention, as shown in fig. 1, a phrase extraction method for unstructured text comprises:
s100, generating phrase extraction rules of each phrase type;
specifically, the phrase types include various phrases such as a partial phrase, a supplementary phrase, a main-to-predicate phrase, a side-by-side phrase, and a move-to-guest phrase.
The partial phrase is a phrase which consists of a modifier and a central phrase and has a relationship between modification and modified structure components; verbs, nouns, adjectives and phrases in which they are preceded by a modifier. The types include: a fixed language + a core word (noun/adjective + pronoun), a schoolchild + a core word (verb + adjective). For example: a beautiful mountain river, a heavy and overlapping dense river, a morning break before courage, a dense cloud evening, etc.
The supplementary phrase is composed of a verb or an adjective and their supplementary phrases, and the supplementary phrase plays a role of supplementary explanation. The types include verb + complement, adjective + complement. For example: beautiful and extremely beautiful, very smart and the like.
The main and predicate phrases form a main and predicate relation by a main and predicate. The types include noun + verb, noun + adjective. For example: bright sunshine, crystal dewdrops, brilliant results, joyful mood, etc.
The parallel phrases are generally formed by combining two or more nouns, verbs, adjectives and the like, the words are in parallel relation, the common conjunctions of the words such as the sum, the parallel and the like are used in the middle, and the part-of-speech requirements of the constituent words of the parallel phrases are generally the same. The types thereof include: noun + noun, verb + verb, adjective + adjective, pronoun + pronoun. For example: group mutual aid, coordination and the like.
The verb phrase is formed by combining a verb and a component governed by the verb behind, the component governed by the verb is the verb, the component governed by the verb is an object and represents a person or thing involved in action, and a common noun, pronoun and the like serve as the phrase. The types thereof include: verb + noun, verb + pronoun, verb + verb, verb + adjective. For example: like swimming, calm back, etc.
And respectively generating a phrase extraction rule of each phrase type according to the characteristics of each phrase type, such as part of speech combination characteristics, position characteristics and the like.
S200, acquiring an unstructured text;
in particular, unstructured text is unstructured data in the form of text (characters, numbers, punctuation, various printable symbols, etc.) as data.
The obtained unstructured text can be text data on a webpage, various articles stored in the intelligent terminal, and documents in a library database.
S300, extracting phrases from the unstructured text according to the phrase extraction rules.
Specifically, after the corresponding phrase extraction rule is established according to the characteristics of each phrase type, the corresponding phrase of the type can be extracted from the unstructured text according to the phrase extraction rule corresponding to each phrase type.
According to the method and the device, the phrase extraction rule is established, the problem of extracting the phrases of the needed types from the unstructured text is effectively solved, a large number of phrases can be obtained to enrich the composition material library, and the collection efficiency is high compared with a manual collection mode.
According to a second embodiment provided by the present invention, as shown in fig. 2, a phrase extraction method for unstructured text comprises:
s110, establishing a phrase type library, wherein the phrase type library comprises a plurality of phrase types;
specifically, the plurality of phrase types included in the phrase type library are various phrases such as a bias phrase, a supplementary phrase, a main-to-predicate phrase, a side-by-side phrase, and an action phrase described in the first embodiment.
S120, acquiring a training sample set of each phrase type, wherein the training sample set comprises a training text and extracted phrases;
specifically, the training text is an unstructured text, such as a sentence extracted from a book and a certain section of speech extracted from an article.
And extracting all phrases of corresponding types from the training text according to the types of the phrases to be extracted, taking the training text and the phrases extracted from the training text as a training sample, wherein each training sample set comprises a plurality of training samples.
For example, the training text is "over overlapped mountains, everything along the way is vigorous, trees are strong and green, green shade is cool, and all the way down the south is surrounded by fragrance".
If the extraction rule of the partial positive phrases is required to be obtained, the phrases extracted from the training text are overlapped mountains. And forming a training sample of the correcting phrase according to the training text and the extracted correcting phrase.
If the extraction rule of the main phrase is required, the phrases extracted from the training text are 'all things are vigorous,' trees are strong and 'green shade is cool'. And forming a training sample of the dominant and subordinate phrases according to the training text and the extracted dominant and subordinate phrases.
S130, generating phrase extraction rules corresponding to each phrase type according to the training sample set of each phrase type;
specifically, after a training sample set of each phrase type is obtained, a phrase extraction model established in advance is trained through the training sample set. The pre-established phrase extraction model is some open source model algorithm, which can be obtained on the network. And the phrase extraction model trained by inputting a large number of training samples is the generated phrase extraction rule.
S200, acquiring an unstructured text;
s300, extracting phrases from the unstructured text according to the phrase extraction rules.
In this embodiment, the phrase extraction model is trained through the training sample set to generate the phrase extraction rule, so that the phrases extracted according to the phrase extraction rule more meet the user requirements, and the extraction accuracy is higher.
According to a third embodiment provided by the present invention, as shown in fig. 3, a phrase extraction method for unstructured text includes:
s110, establishing a phrase type library, wherein the phrase type library comprises a plurality of phrase types;
s120, acquiring a training sample set of each phrase type, wherein the training sample set comprises a training text and extracted phrases;
s131, segmenting each training text in the training sample set to obtain each word corresponding to each training text, the part of speech of the word and the position sequence of the word;
specifically, the existing word segmentation tool is used for segmenting the training texts in the training sample set to obtain word vectors corresponding to each word in the training texts, and the part of speech of each word is marked in the word vectors. The word vectors are [ word N1, word N2, word N3.. the word ni ], i belongs to N, and the words N1, N2 and N n3... the word ni in the word vectors are arranged according to the sequence of the positions of the words in the training text, namely the position sequence of the words.
For example: the training text is that the training text crosses overlapped mountains, all things along the way are vigorous, trees are strong and green, green shade is cool, and all the way is in the south and surrounded by fragrance.
The word vector obtained after the word segmentation of the overlapped mountain is crossed is [ crossed/v,/u, mountain/n ] of the overlapped/a. The word vector obtained after the word segmentation of 'all things along the way are vigorous is ═ along/n, all things/n, vigorous/v'. The word vector obtained after the word segmentation of the tree is [ tree/n, rich/a ]. The word vector obtained after the word segmentation of the 'green shade cooling' is [ green shade/n, cooling/a ].
S132, analyzing the phrases extracted from each training text to obtain phrase extraction features corresponding to each phrase type, wherein the phrase extraction features comprise part-of-speech combination features and word position features;
specifically, after the training text is subjected to word segmentation to obtain each word corresponding to each word in the training text, the part of speech of the word and the position sequence of the word, the phrase extracted from the training text is combined, and the phrase extraction features corresponding to each phrase type can be analyzed. The phrase extraction rules for each phrase type may be different or partially the same, and the phrase extraction features may include part-of-speech combination features, word position features, word-connection features between words, and the like.
For example, the training text "crosses a mountain overlapped with each other", a word vector obtained by word segmentation is [ crossing/v, overlapping/a,/u, mountain/n ], the extracted bias phrase is "overlapped mountain", it is known that the overlap is an adjective and the mountain is a noun in the extracted phrase according to the word segmentation result, the extracted feature obtained according to the training text and the extracted phrase is a phrase combination of extracted adjective + noun, the adjective is preceding and the noun is following, and a structural assistant word is included between the adjective and the noun.
As another example, the training text "student go through homework" and the extracted partial phrase is "go through". The training text is segmented to obtain word vectors [ student/n, detail/a, check/v, homework/n ], and the extracted phrases are carefully adjectives and checked as verbs according to the segmentation result, so that the extracted features obtained according to the training text and the extracted phrases are extracted phrase combinations of the adjectives and the verbs, wherein the adjectives are in front and the verbs are in back.
Similarly, other types of phrases are also analyzed according to the training text and the extracted phrases to obtain extracted features of the phrases.
In the same training sample set, different extraction features may be obtained according to different training texts, that is, the extraction features include a plurality of features.
S133, generating phrase extraction rules corresponding to each phrase type according to the obtained extraction features by using a machine learning method;
specifically, the phrase extraction rules corresponding to each phrase type can be generated by training a pre-established phrase extraction model according to the training sample set and the corresponding extraction features by using the existing machine learning method.
S200, acquiring an unstructured text;
s300, extracting phrases from the unstructured text according to the phrase extraction rules.
Specifically, phrases of corresponding types can be extracted from the unstructured text according to phrase extraction rules corresponding to each phrase type, such as part-of-speech combination characteristics, word position characteristics, word connection characteristics, and the like.
According to the embodiment, a plurality of extraction features can be obtained according to different training texts, and a plurality of phrases in the same phrase type can be extracted through the extraction features, so that the extracted phrases are richer, the requirements of users can be met, and the use experience of the users is improved.
According to a fourth embodiment provided by the present invention, as shown in fig. 4, a phrase extraction method for unstructured text includes:
s100, generating phrase extraction rules of each phrase type;
s200, acquiring an unstructured text;
s210, acquiring basic words;
s310, extracting phrases containing the basic words from the unstructured text according to the phrase extraction rules and the basic words.
Specifically, when a user needs to search the unstructured text for a phrase containing a certain word, the word (i.e., the basic word) that the user wants to contain can be input, and then the phrase containing the basic word can be extracted from the unstructured text according to the generated phrase extraction rule of each phrase type and the input basic word.
For example, if the user needs to extract a biased phrase containing "mountain," the "mountain" is a basic word, and then a phrase containing "mountain" is extracted from the unstructured text according to the phrase extraction rule of the biased phrase, such as "beautiful mountain," "high mountain," "tall mountain," "deep mountain," "great mountain," and "great mountain" and so on.
According to the embodiment, the basic words and the generated phrase extraction rules can help the user to search the phrases containing the basic words, so that the user is helped to complete answers to various question types, the learning of the user is assisted, and the use experience of the user is improved.
According to a fifth embodiment provided by the present invention, as shown in fig. 5, a phrase extraction method for unstructured text comprises:
s100, generating phrase extraction rules of each phrase type;
s200, acquiring an unstructured text;
s210, acquiring basic words;
s311, finding the basic words in the unstructured text;
s312, based on the basic words, finding out target words which accord with extraction characteristics from the unstructured text according to the phrase extraction rules and the parts of speech of the basic words;
s313, combining the basic word and the target word to obtain a phrase containing the basic word.
Specifically, when searching for a phrase containing a basic word in an unstructured text, the basic word is searched in the unstructured text, then a target word which meets the extraction characteristics is found in the unstructured text based on the basic word and according to a phrase extraction rule corresponding to the type of the phrase to be extracted, and then the basic word and the target word are combined to obtain the phrase containing the basic word.
For example, the user needs to extract a bias phrase containing "mountain," which is a noun, and the extracted features of the bias phrase include:
the first method comprises the following steps: the adjective + verb phrase combination is carried out, and the adjective is in front of the verb and the verb is behind the verb;
and the second method comprises the following steps: phrase combinations of adjectives and nouns, with adjectives preceding and nouns succeeding;
since "mountain" is a noun, a phrase containing "mountain" should be extracted from unstructured text according to the second extraction feature. According to the second extraction feature, it can be known that the adjective in front of the mountain is searched on the basis of the mountain, and the adjective is the target word. Finally, the adjectives and the mountains are combined according to the position characteristics of the words, namely the adjectives are in front of each other and the nouns are behind each other, and the phrases containing the mountains can be obtained.
According to the embodiment, the basic words and the generated phrase extraction rules can help the user to search the phrases containing the basic words, so that the user is helped to complete answers to various question types, the learning of the user is assisted, and the use experience of the user is improved.
According to a sixth embodiment provided by the present invention, as shown in fig. 6, a phrase extracting apparatus for unstructured text comprises:
a rule generating module 100 for generating a phrase extraction rule for each phrase type;
specifically, the phrase types include various phrases such as a partial phrase, a supplementary phrase, a main-to-predicate phrase, a side-by-side phrase, and a move-to-guest phrase.
The partial phrase is a phrase which consists of a modifier and a central phrase and has a relationship between modification and modified structure components; verbs, nouns, adjectives and phrases in which they are preceded by a modifier. The types include: a fixed language + a core word (noun/adjective + pronoun), a schoolchild + a core word (verb + adjective). For example: a beautiful mountain river, a heavy and overlapping dense river, a morning break before courage, a dense cloud evening, etc.
The supplementary phrase is composed of a verb or an adjective and their supplementary phrases, and the supplementary phrase plays a role of supplementary explanation. The types include verb + complement, adjective + complement. For example: beautiful and extremely beautiful, very smart and the like.
The main and predicate phrases form a main and predicate relation by a main and predicate. The types include noun + verb, noun + adjective. For example: bright sunshine, crystal dewdrops, brilliant results, joyful mood, etc.
The parallel phrases are generally formed by combining two or more nouns, verbs, adjectives and the like, the words are in parallel relation, the common conjunctions of the words such as the sum, the parallel and the like are used in the middle, and the part-of-speech requirements of the constituent words of the parallel phrases are generally the same. The types thereof include: noun + noun, verb + verb, adjective + adjective, pronoun + pronoun. For example: group mutual aid, coordination and the like.
The verb phrase is formed by combining a verb and a component governed by the verb behind, the component governed by the verb is the verb, the component governed by the verb is an object and represents a person or thing involved in action, and a common noun, pronoun and the like serve as the phrase. The types thereof include: verb + noun, verb + pronoun, verb + verb, verb + adjective. For example: like swimming, calm back, etc.
And respectively generating a phrase extraction rule of each phrase type according to the characteristics of each phrase type, such as part of speech combination characteristics, position characteristics and the like.
A text obtaining module 200, configured to obtain an unstructured text;
in particular, unstructured text is unstructured data in the form of text (characters, numbers, punctuation, various printable symbols, etc.) as data.
The obtained unstructured text can be text data on a webpage, various articles stored in the intelligent terminal, and documents in a library database.
A phrase extraction module 300, configured to extract a phrase from the unstructured text according to the phrase extraction rule.
Specifically, after the corresponding phrase extraction rule is established according to the characteristics of each phrase type, the corresponding phrase of the type can be extracted from the unstructured text according to the phrase extraction rule corresponding to each phrase type.
According to the method and the device, the phrase extraction rule is established, the problem of extracting the phrases of the needed types from the unstructured text is effectively solved, a large number of phrases can be obtained to enrich the composition material library, and the collection efficiency is high compared with a manual collection mode.
Preferably, the rule generating module 100 includes:
a phrase library establishing unit 110, configured to establish a phrase type library, where the phrase type library includes a plurality of phrase types;
specifically, the plurality of phrase types included in the phrase type library are various phrases such as a bias phrase, a supplementary phrase, a main-to-predicate phrase, a side-by-side phrase, and an action phrase described in the first embodiment.
A sample set obtaining unit 120, configured to obtain a training sample set for each phrase type, where the training sample set includes a training text and extracted phrases;
specifically, the training text is an unstructured text, such as a sentence extracted from a book and a certain section of speech extracted from an article.
And extracting all phrases of corresponding types from the training text according to the types of the phrases to be extracted, taking the training text and the phrases extracted from the training text as a training sample, wherein each training sample set comprises a plurality of training samples.
For example, the training text is "over overlapped mountains, everything along the way is vigorous, trees are strong and green, green shade is cool, and all the way down the south is surrounded by fragrance".
If the extraction rule of the partial positive phrases is required to be obtained, the phrases extracted from the training text are overlapped mountains. And forming a training sample of the correcting phrase according to the training text and the extracted correcting phrase.
If the extraction rule of the main phrase is required, the phrases extracted from the training text are 'all things are vigorous,' trees are strong and 'green shade is cool'. And forming a training sample of the dominant and subordinate phrases according to the training text and the extracted dominant and subordinate phrases.
And a rule generating unit 130, configured to generate a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type.
Specifically, after a training sample set of each phrase type is obtained, a phrase extraction model established in advance is trained through the training sample set. The pre-established phrase extraction model is some open source model algorithm, which can be obtained on the network. And the phrase extraction model trained by inputting a large number of training samples is the generated phrase extraction rule.
The phrase extraction rule is generated by training the phrase extraction model through the training sample set, so that the phrases extracted according to the phrase extraction rule are more in line with the requirements of users, and the extraction precision is higher.
Preferably, the rule generating unit 130 includes:
a word segmentation subunit 131, configured to perform word segmentation on each training text in the training sample set to obtain each word, a part of speech of the word, and a position sequence of the word corresponding to each training text;
specifically, the existing word segmentation tool is used for segmenting the training texts in the training sample set to obtain word vectors corresponding to each word in the training texts, and the part of speech of each word is marked in the word vectors. The word vectors are [ word N1, word N2, word N3.. the word ni ], i belongs to N, and the words N1, N2 and N n3... the word ni in the word vectors are arranged according to the sequence of the positions of the words in the training text, namely the position sequence of the words.
For example: the training text is that the training text crosses overlapped mountains, all things along the way are vigorous, trees are strong and green, green shade is cool, and all the way is in the south and surrounded by fragrance.
The word vector obtained after the word segmentation of the overlapped mountain is crossed is [ crossed/v,/u, mountain/n ] of the overlapped/a. The word vector obtained after the word segmentation of 'all things along the way are vigorous is ═ along/n, all things/n, vigorous/v'. The word vector obtained after the word segmentation of the tree is [ tree/n, rich/a ]. The word vector obtained after the word segmentation of the 'green shade cooling' is [ green shade/n, cooling/a ].
A feature analysis subunit 132, configured to analyze, according to the extracted phrases of each training text, to obtain phrase extraction features corresponding to each phrase type, where the phrase extraction features include part-of-speech combination features and word position features;
specifically, after the training text is subjected to word segmentation to obtain each word corresponding to each word in the training text, the part of speech of the word and the position sequence of the word, the phrase extracted from the training text is combined, and the phrase extraction features corresponding to each phrase type can be analyzed. The phrase extraction rules for each phrase type may be different or partially the same, and the phrase extraction features may include part-of-speech combination features, word position features, word-connection features between words, and the like.
For example, the training text "crosses a mountain overlapped with each other", a word vector obtained by word segmentation is [ crossing/v, overlapping/a,/u, mountain/n ], the extracted bias phrase is "overlapped mountain", it is known that the overlap is an adjective and the mountain is a noun in the extracted phrase according to the word segmentation result, the extracted feature obtained according to the training text and the extracted phrase is a phrase combination of extracted adjective + noun, the adjective is preceding and the noun is following, and a structural assistant word is included between the adjective and the noun.
As another example, the training text "student go through homework" and the extracted partial phrase is "go through". The training text is segmented to obtain word vectors [ student/n, detail/a, check/v, homework/n ], and the extracted phrases are carefully adjectives and checked as verbs according to the segmentation result, so that the extracted features obtained according to the training text and the extracted phrases are extracted phrase combinations of the adjectives and the verbs, wherein the adjectives are in front and the verbs are in back.
Similarly, other types of phrases are also analyzed according to the training text and the extracted phrases to obtain extracted features of the phrases.
In the same training sample set, different extraction features may be obtained according to different training texts, that is, the extraction features include a plurality of features.
A rule generating subunit 133, configured to generate a phrase extraction rule corresponding to each phrase type according to the obtained extraction features by using a machine learning method.
Specifically, the phrase extraction rules corresponding to each phrase type can be generated by training a pre-established phrase extraction model according to the training sample set and the corresponding extraction features by using the existing machine learning method.
And extracting phrases of corresponding types from the unstructured text according to phrase extraction rules corresponding to each phrase type, such as part of speech combination characteristics, word position characteristics, word connection characteristics and the like.
A plurality of extraction features can be obtained according to different training texts, and a plurality of phrases in the same phrase type can be extracted through the extraction features, so that the extracted phrases are richer, the requirements of a user can be met, and the use experience of the user is further improved.
Preferably, the method further comprises the following steps:
a word obtaining module 400, configured to obtain a basic word;
the phrase extraction module 300 is further configured to extract a phrase including the basic word from the unstructured text according to the phrase extraction rule and the basic word.
Specifically, when a user needs to search the unstructured text for a phrase containing a certain word, the word (i.e., the basic word) that the user wants to contain can be input, and then the phrase containing the basic word can be extracted from the unstructured text according to the generated phrase extraction rule of each phrase type and the input basic word.
For example, if the user needs to extract a biased phrase containing "mountain," the "mountain" is a basic word, and then a phrase containing "mountain" is extracted from the unstructured text according to the phrase extraction rule of the biased phrase, such as "beautiful mountain," "high mountain," "tall mountain," "deep mountain," "great mountain," and "great mountain" and so on.
Through the basic words and the generated phrase extraction rules, the user can be helped to search the phrases containing the basic words, and then the user is helped to complete answers to various question types, so that the learning of the user is assisted, and the use experience of the user is improved.
Preferably, the phrase extraction module 300 includes:
a basic word searching unit 310, configured to find the basic word in the unstructured text;
a target word searching unit 320, configured to find, based on the basic word, a target word that meets the extraction characteristics from the unstructured text according to the phrase extraction rule and the part of speech of the basic word;
a word combining unit 330, configured to combine the basic word and the target word to obtain a phrase including the basic word.
Specifically, when searching for a phrase containing a basic word in an unstructured text, the basic word is searched in the unstructured text, then a target word which meets the extraction characteristics is found in the unstructured text based on the basic word and according to a phrase extraction rule corresponding to the type of the phrase to be extracted, and then the basic word and the target word are combined to obtain the phrase containing the basic word.
For example, the user needs to extract a bias phrase containing "mountain," which is a noun, and the extracted features of the bias phrase include:
the first method comprises the following steps: the adjective + verb phrase combination is carried out, and the adjective is in front of the verb and the verb is behind the verb;
and the second method comprises the following steps: phrase combinations of adjectives and nouns, with adjectives preceding and nouns succeeding;
since "mountain" is a noun, a phrase containing "mountain" should be extracted from unstructured text according to the second extraction feature. According to the second extraction feature, it can be known that the adjective in front of the mountain is searched on the basis of the mountain, and the adjective is the target word. Finally, the adjectives and the mountains are combined according to the position characteristics of the words, namely the adjectives are in front of each other and the nouns are behind each other, and the phrases containing the mountains can be obtained.
Through the basic words and the generated phrase extraction rules, the user can be helped to search the phrases containing the basic words, and then the user is helped to complete answers to various question types, so that the learning of the user is assisted, and the use experience of the user is improved.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A phrase extraction method for unstructured text, comprising:
generating phrase extraction rules for each phrase type;
acquiring an unstructured text;
and extracting phrases from the unstructured text according to the phrase extraction rules.
2. The method of claim 1, wherein the generating phrase extraction rules for each phrase type specifically comprises:
establishing a phrase type library, wherein the phrase type library comprises a plurality of phrase types;
acquiring a training sample set of each phrase type, wherein the training sample set comprises a training text and extracted phrases;
and generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type.
3. The method according to claim 2, wherein the generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type specifically comprises:
segmenting each training text in the training sample set to obtain each word corresponding to each training text, the part of speech of the word and the position sequence of the word;
analyzing and obtaining phrase extraction characteristics corresponding to each phrase type according to the extracted phrases of each training text, wherein the phrase extraction characteristics comprise part-of-speech combination characteristics and word position characteristics;
and generating a phrase extraction rule corresponding to each phrase type according to the obtained extraction features by using a machine learning method.
4. A phrase extraction method for unstructured text according to any of claims 1-3, characterized in that it further comprises:
acquiring basic words;
the extracting phrases from the unstructured text according to the phrase extraction rule specifically includes:
and extracting phrases containing the basic words from the unstructured text according to the phrase extraction rules and the basic words.
5. The method according to claim 4, wherein the extracting a phrase containing the basic word from the unstructured text according to the phrase extraction rule and the basic word specifically comprises:
finding the base word in the unstructured text;
on the basis of the basic words, finding out target words which accord with extraction characteristics from the unstructured text according to the phrase extraction rules and the parts of speech of the basic words;
and combining the basic words and the target words to obtain phrases containing the basic words.
6. A phrase extraction apparatus for unstructured text, comprising:
the rule generating module is used for generating phrase extraction rules of each phrase type;
the text acquisition module is used for acquiring the unstructured text;
and the phrase extraction module is used for extracting phrases from the unstructured text according to the phrase extraction rules.
7. The apparatus of claim 6, wherein the rule generating module comprises:
the phrase library establishing unit is used for establishing a phrase type library, and the phrase type library comprises a plurality of phrase types;
the system comprises a sample set acquisition unit, a phrase extraction unit and a phrase extraction unit, wherein the sample set acquisition unit is used for acquiring a training sample set of each phrase type, and the training sample set comprises a training text and extracted phrases;
and the rule generating unit is used for generating a phrase extraction rule corresponding to each phrase type according to the training sample set of each phrase type.
8. The apparatus of claim 7, wherein the rule generating unit comprises:
the word segmentation subunit is used for performing word segmentation on each training text in the training sample set to obtain each word corresponding to each training text, the part of speech of the word and the position sequence of the word;
the feature analysis subunit is used for analyzing and obtaining phrase extraction features corresponding to each phrase type according to the extracted phrases of each training text, wherein the phrase extraction features comprise part-of-speech combination features and word position features;
and the rule generating subunit is used for generating a phrase extraction rule corresponding to each phrase type according to the obtained extraction features by using a machine learning method.
9. A phrase extraction mechanism for unstructured text according to any of claims 6-8, characterized by further comprising:
the word acquisition module is used for acquiring basic words;
the phrase extraction module is further configured to extract phrases containing the basic words from the unstructured text according to the phrase extraction rules and the basic words.
10. A phrase extraction device for unstructured text as defined in claim 9, wherein the phrase extraction module comprises:
a basic word searching unit, configured to find the basic word in the unstructured text;
the target word searching unit is used for searching a target word which accords with the extraction characteristics from the unstructured text on the basis of the basic word according to the phrase extraction rule and the part of speech of the basic word;
and the word combination unit is used for combining the basic words and the target words to obtain phrases containing the basic words.
CN201910365420.2A 2019-04-30 2019-04-30 Phrase extraction method and device for unstructured text Pending CN111950271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910365420.2A CN111950271A (en) 2019-04-30 2019-04-30 Phrase extraction method and device for unstructured text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910365420.2A CN111950271A (en) 2019-04-30 2019-04-30 Phrase extraction method and device for unstructured text

Publications (1)

Publication Number Publication Date
CN111950271A true CN111950271A (en) 2020-11-17

Family

ID=73335898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910365420.2A Pending CN111950271A (en) 2019-04-30 2019-04-30 Phrase extraction method and device for unstructured text

Country Status (1)

Country Link
CN (1) CN111950271A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
CN107463548A (en) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 Short phrase picking method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
CN107463548A (en) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 Short phrase picking method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘荣: "面向教育领域的固定短语提取方法研究", 《中国博士学位论文全文数据库哲学与人文科技辑(月刊)》 *

Similar Documents

Publication Publication Date Title
Saad et al. Osac: Open source arabic corpora
CN107818085B (en) Answer selection method and system for reading understanding of reading robot
CN108874937B (en) Emotion classification method based on part of speech combination and feature selection
CN110210019A (en) A kind of event argument abstracting method based on recurrent neural network
CN111259631B (en) Referee document structuring method and referee document structuring device
CN103077164A (en) Text analysis method and text analyzer
Othman et al. English-asl gloss parallel corpus 2012: Aslg-pc12
CN106021288A (en) Method for rapid and automatic classification of classroom testing answers based on natural language analysis
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN102708164B (en) Method and system for calculating movie expectation
Gupta et al. Text summarization of Hindi documents using rule based approach
CN108763539A (en) A kind of file classification method and system based on parts of speech classification
CN103870001A (en) Input method candidate item generating method and electronic device
CN109271492A (en) Automatic generation method and system of corpus regular expression
CN106250524A (en) Organization name extraction method and device based on semantic information
CN109508448A (en) Short information method, medium, device are generated based on long article and calculate equipment
Rizzo Getting on with corpus compilation: from theory to practice
CN117709465A (en) Key information extraction method based on large language model
Sari et al. Indexing name in hadith translation using hidden markov model (hmm)
Aouichat et al. Building TALAA-AFAQ, a corpus of Arabic FActoid question-answers for a question answering system
Li et al. Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures.
CN111950271A (en) Phrase extraction method and device for unstructured text
Richter et al. Tracking the evolution of written language competence: an NLP–based approach
CN113127627A (en) Poetry recommendation method based on LDA topic model and poetry knowledge map
CN113343667A (en) Network character attribute extraction and relation analysis method based on multi-source information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201117