CN112580349B

CN112580349B - Phrase extraction method and device and electronic equipment

Info

Publication number: CN112580349B
Application number: CN202011558253.2A
Authority: CN
Inventors: 李雪婷; 简仁贤; 吴文杰; 刘影
Original assignee: Emotibot Technologies Ltd
Current assignee: Emotibot Technologies Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-09-29
Anticipated expiration: 2040-12-24
Also published as: CN112580349A

Abstract

The application provides a phrase extraction method and device and electronic equipment, wherein the method comprises the following steps: acquiring sentences to be processed; sequentially carrying out word segmentation, part-of-speech tagging and dependency syntax processing on sentences to be processed to generate dependency relation labels among different words and part-of-speech labels of each word; judging whether the core relation word with the core relation tag is a verb or not according to the dependency relation tag among different words and the part-of-speech tag of each word; if the core relation word is a verb, searching a target word forming a specified dependency relationship with the core relation word; and determining whether to combine and output the core relation word and the target word according to the label information of the target word. According to the scheme, the phrases can be automatically extracted by a computer based on the part of speech and the dependency relationship of the word segmentation according to a certain rule, so that the phrase extraction efficiency and accuracy are improved.

Description

Phrase extraction method and device and electronic equipment

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a phrase extraction method and apparatus, and an electronic device.

Background

The recognition and analysis of basic phrases is one of the important tasks of natural language shallow parsing. The analysis result of the basic phrase can simplify the structure of sentences and reduce the complexity of syntactic analysis. And as a highly deterministic partial analysis result, the basic phrase analysis will solve most of the local ambiguous structure problems, thus laying the foundation for deeper chunk analysis and complete syntactic analysis. For example, in the existing natural language processing technology field, chinese phrase extraction has great help for coarse-grained word segmentation, keyword extraction, information extraction, and the like.

Therefore, the existing Chinese phrase extraction method mainly starts from a training corpus, and the method consumes manpower and also faces the problem that the accuracy reaches a critical point and is difficult to improve.

Disclosure of Invention

The embodiment of the application provides a phrase extraction method which is used for reducing labor cost and improving extraction efficiency.

The embodiment of the application provides a phrase extraction method, which comprises the following steps:

acquiring sentences to be processed;

sequentially performing word segmentation, part-of-speech tagging and dependency syntax processing on the sentence to be processed to generate dependency relation labels among different words and part-of-speech labels of each word;

judging whether the core relation word with the core relation tag is a verb or not according to the dependency relation tag among different words and the part-of-speech tag of each word;

if the core relation word is a verb, searching a target word forming a specified dependency relationship with the core relation word;

and determining whether to combine and output the core relation word and the target word according to the label information of the target word.

In one embodiment, the searching for a target word that constitutes a specified dependency with the core relationship word includes:

searching a target word with a structural relationship in a shape with the core relationship word according to the dependency relationship labels among different words;

And determining whether to combine and output the core relation word and the target word according to the label information of the target word, wherein the method comprises the following steps:

and if the part of speech of the target word is an adverb and is adjacent to the core relation word, merging and outputting the core relation word and the target word.

In an embodiment, if the part of speech of the target word is an adverb and is adjacent to the core relation word, the core relation word and the target word are combined and output, including:

if the part of speech of the target word is an adverb and is adjacent to the core relation word, judging whether a child node exists in the target word or not;

if the target word has a child node, merging and outputting the core related word, the target word and the vocabulary corresponding to the child node.

In an embodiment, the determining whether to perform merging output of the core related word and the target word according to the tag information of the target word includes:

if the part of speech of the target word is an adverb and is not adjacent to the core relation word, judging whether a child node exists in the target word or not;

if the target word has the child node, merging and outputting the target word and the vocabulary corresponding to the child node.

if the part of speech of the target word is a preposition, judging the part of speech of an object forming a preposition relation with the target word, and if the part of speech is a verb, taking the object as a core word, and determining whether to combine and output the core word and the vocabulary according to the part of speech of a vocabulary forming a specified dependency relation with the core word.

searching a target word with a dynamic complement structural relation with the core relation word according to the dependency relation labels among different words;

and if the part of speech of the target word is an adjective, merging and outputting the core relation word and the target word.

In one embodiment, if the part of speech of the target word is an adjective, the core related word and the target word are combined and output, including:

if the part of speech of the target word is an adjective, judging whether a child node exists in the target word or not;

if the part of speech of the target word is a verb, the target word is used as a core word, and whether the core word and the vocabulary are combined and output is determined according to the part of speech of the vocabulary forming the appointed dependency relationship with the core word.

searching target words with main-predicate relation, guest relation or prepositive object relation with the core relation words;

if the part of speech of the target word is a preposition, judging the part of speech of an object forming a preposition relation with the target word, and if the part of speech is a verb, taking the object as a core word, and determining whether to combine and output the core word and the vocabulary according to the part of speech of the vocabulary forming the appointed dependency relation with the core word.

if the part of speech of the target word is noun, pronoun or number word, judging whether the target word has child nodes or not;

searching target words with a guest relation or a double-language relation with the core relation words;

Judging whether a child node exists in the target word according to the label information of the target word;

In one embodiment, the method further comprises: and outputting the corresponding part of speech of the phrase while outputting the phrase.

The embodiment of the application also provides a phrase extraction device, which comprises:

the sentence acquisition module is used for acquiring sentences to be processed;

the tag generation module is used for sequentially carrying out word segmentation, part-of-speech tagging and dependency syntax processing on the sentence to be processed to generate dependency relationship tags among different words and part-of-speech tags of each word;

the part-of-speech judging module is used for judging whether the core relation word with the core relation tag is a verb or not according to the dependency relation tag among different words and the part-of-speech tag of each word;

the target word searching module is used for searching target words forming specified dependency relationship with the core relationship words when the core relationship words are verbs;

and the merging judgment module is used for determining whether to merge and output the core relation word and the target word according to the label information of the target word.

The embodiment of the application provides electronic equipment, which comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the phrase extraction method described above.

Embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a processor to perform the phrase extraction method described above.

According to the technical scheme provided by the embodiment of the application, the dependency relation labels among different words and the part-of-speech labels of each word are generated by carrying out word segmentation, part-of-speech labeling and dependency syntax processing on sentences to be processed, and then whether the core relation words with the core relation labels are verbs is judged; when the core relation word is a verb, searching a target word forming a specified dependency relationship with the core relation word; according to the label information of the target word, whether the combination output of the core relation word and the target word is carried out is determined, so that phrase extraction is carried out without manual perception, and the computer automatically extracts the phrases according to a certain rule based on the part of speech and the dependency relationship of the word segmentation, thereby improving the phrase extraction efficiency and the phrase extraction accuracy.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a phrase extraction method according to an embodiment of the present application;

FIG. 3 is a detailed flowchart of a phrase extraction method according to an embodiment of the present application;

fig. 4 is a block diagram of a phrase extraction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 100 may be used to perform the phrase extraction method provided by the embodiments of the present application. As shown in fig. 1, the electronic device 100 includes: one or more processors 102, one or more memories 104 storing processor-executable instructions. Wherein the processor 102 is configured to perform the phrase extraction method provided by the embodiments of the application described below.

The processor 102 may be a gateway, an intelligent terminal, or a device comprising a Central Processing Unit (CPU), an image processing unit (GPU), or other form of processing unit having data processing capabilities and/or instruction execution capabilities, may process data from other components in the electronic device 100, and may control other components in the electronic device 100 to perform desired functions.

The memory 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 102 to implement the phrase extraction method described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

In one embodiment, the electronic device 100 shown in FIG. 1 may also include an input device 106, an output device 108, and a data acquisition device 110, which are interconnected by a bus system 112 and/or other forms of connection mechanisms (not shown). It should be noted that the components and structures of the electronic device 100 shown in fig. 1 are exemplary only and not limiting, as the electronic device 100 may have other components and structures as desired.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, mouse, microphone, touch screen, and the like. The output device 108 may output various information (e.g., images or sounds) to the outside (e.g., a user), and may include one or more of a display, a speaker, and the like. The data acquisition device 110 may acquire images of the subject and store the acquired images in the memory 104 for use by other components. The data acquisition device 110 may be a camera, for example.

In an embodiment, the components in the example electronic apparatus 100 for implementing the phrase extraction method according to the embodiment of the present application may be integrally disposed, or may be separately disposed, such as integrally disposing the processor 102, the memory 104, the input device 106, and the output device 108, and separately disposing the data acquisition device 110.

In an embodiment, the example electronic device 100 for implementing the phrase extraction method of the embodiments of the present application may be implemented as a smart terminal such as a smart phone, tablet, desktop computer, or the like.

Fig. 2 is a schematic flow chart of a phrase extraction method according to an embodiment of the present application. The method may be performed by an electronic device such as a computer. As shown in fig. 2, the method includes the following steps S210 to S250.

Step S210: and acquiring a sentence to be processed.

The phrase is a language unit without sentence tone, which is combined by three language units capable of collocating on the aspects of syntax, semantics and language, and is also called phrase. It is a syntax unit that is larger than a word but not a sentence. According to the scheme provided by the embodiment of the application, the phrase can be extracted from the sentence to be processed.

For example, the sentence to be processed may be "a backoffice minor king holds a country standing a lot for a certain 25 days," a organization makes strict negotiations and strong resistance by the so-called "regional economy and trade act" to prompt a party to correct errors immediately. The phrases may be "the minor part of the backoffice is king somewhere", "the country is standing on a large scale and is getting a certain plum", etc.

Step S220: and sequentially performing word segmentation, part-of-speech tagging and dependency syntax processing on the sentence to be processed to generate dependency relation labels among different words and part-of-speech labels of each word.

In one embodiment, the ICTCLAS word segmentation device may be used to perform word segmentation on the sentence to be processed to obtain individual words. And then, marking the parts of speech of the words subjected to the word segmentation operation by using an ICTCLAS word segmentation device, namely marking each word with a corresponding part of speech label, wherein the noun label is n, the time word label is nt, the numeral word label is m, the graduated word label is q, the pronoun label is r, the verb label is v, the adjective label is a, the adverb label is d, the preposition label is p, the name label is nr, the place name label is ns, the group organization unit name label is ni, the punctuation mark is w and the like.

For example, the sentence to be processed listed above can obtain "the backoffice_ni minor length_n king something_nh25 days_nt holding_ v A state_ns resident_n Hua_ns big making_n li something_nh, the_w proposes_v strictly to the_a and the_cc strong_a anti-proposal_n by the_v so-called_a" _w and the_c trade_n act_n "_w for the_p organization_ni to correct the_v error_n. W).

After the part-of-speech tag of each word is obtained, the dependency relationship between different words can be analyzed through the existing dependency syntax processing tool (such as LTP), and the corresponding tag is marked for the dependency relationship. As shown in Table 1 below, dependencies between words may have a master predicate relationship, a move guest relationship, a meta guest relationship, and so on.

TABLE 1 label paraphrasing of dependencies

Relationship type	Label (Label)
		Relationship of main and secondary terms	SBV
Relation of moving guest	VOB
		Guest-guest relationship	IOB
Front object	FOB
		Double language	DBL
Centering relationship	ATT
		Structure in form	ADV
Dynamic compensation structure	CMP
		Parallel relationship	COO
Medium guest relationship	POB
		Left additional relation	LAD
Right additional relationship	RAD
		Punctuation mark	WP
Core relationships	HED

For the pending sentences listed above, part-of-speech tags and dependency tags can be obtained as shown in Table 2 below.

TABLE 2 part of speech and interdependence tags for each word in a sentence to be processed

Sequence number	Words and phrases	Part of speech	Father node	Dependency relationship
					1	Logistics department	ni	2	ATT
2	Length of the secondary part	n	3	ATT
					3	Wang to be somehow	nh	5	SBV
4	25 days	nt	5	ADV
					5	Holding in	v	0	HED
6	Some country	ns	9	ATT
					7	Residence is carried out	n	9	ATT
8	Huahua (Chinese character)	ns	7	VOB
					9	Radix seu herba Gei aleppici	n	10	ATT
10	Some kind of plum	nh	5	VOB
					11	，	wp	5	WP
12	Then the process is completed	p	23	ADV
					13	An organization of	ni	14	SBV
14	By passing through	v	12	POB
					15	So-called	a	21	ATT
16	“	wp	21	WP
					17	Certain area	ns	21	ATT
18	Economical production	n	21	ATT
					19	And (3) with	c	20	LAD
20	Trade	n	18	COO
					21	Act of act	n	14	VOB
22	”	wp	21	WP
					23	Proposed that	v	5	COO
24	Yan Zheng	a	25	ATT
					25	Interval of	n	23	VOB
26	And	c	28	LAD
					27	strong intensity	a	28	ATT
28	Resistance meeting	n	25	COO
					29	，	wp	23	WP
30	Urge to	v	23	COO
					31	Somebody's square	n	30	DBL
32	Immediately	d	34	ADV
					33	Correction of	v	30	VOB
34	Errors	n	33	VOB
					35	。	wp	30	WP

As shown in table 2 above, "backoffice" is an organization-related noun, "minor" is a noun, and a centering relationship is provided between "backoffice" and "minor". The core relationship is "call".

Step S230: judging whether the core relation word with the core relation tag is a verb or not according to the dependency relation tag among different words and the part-of-speech tag of each word.

For example, table 2 above, the core Guan Jici with core relationship tags (also referred to as HED nodes) is "hold" and core relationship words can be considered the core of the sentence to be processed. According to the part-of-speech label of "holding", it can be judged whether "holding" is a verb. If the HED node is not a verb, the sentence to be processed may be placed into an invalid sentence subset. The embodiment of the application mainly aims at sentences with verbs as sentence cores.

Step S240: if the core relation word is a verb, searching a target word forming a specified dependency relationship with the core relation word.

In one embodiment, the target words that make up the specified dependency relationship with the core relationship word may be words that have SBV/VOB/IOB/FOB/DBL/ADV/CMP/COO/POB relationships with the core relationship word, which words may be considered primary children of the HED node, which in turn may be referred to as SBV node, VOB node, IOB node, FOB node, DBL node, ADV node, CMP node, COO node, POB node. For the purpose of distinguishing, the word which is searched and has any one of the relations with the core relation word is called as a target word.

In the above table 2, for example, the "call" of the number 5 has an SBV relationship with the "wang" of the number 3, the "call" of the number 5 has an ADV relationship with the "25 th" of the number 4, the "call" of the number 5 has a VOB relationship with the "li" of the number 10, and the "call" of the number 5 has a COO relationship with the "propose" of the number 23. The target words may be "wang somewhere" (i.e., SBV node), "25 days" (i.e., ADV node), "li somewhere" (i.e., VOB node), "propose" (i.e., COO node).

Step S250: and determining whether to combine and output the core relation word and the target word according to the label information of the target word.

The tag information of the target word may include part-of-speech tags of the target word and dependency tags with other words. Combining output refers to combining together to output as one phrase.

In one embodiment, the computer may find target words having in-shape structural relationships (ADV) with core relationship words based on dependency tags between different words; and according to the label information of the target word, if the part of speech of the target word is an adverb and is adjacent to the core relation word, merging and outputting the core relation word and the target word. In an embodiment, whether the target word has a child node or not may be further determined; if the target word has a child node, merging and outputting the core related word, the target word and the vocabulary corresponding to the child node.

The first vocabulary is used to modify the second vocabulary, and the first vocabulary may be considered as a child node of the second vocabulary. Taking the sentence to be processed listed above as an example, it can be seen from table 2 that "recall" is a core relational node whose child nodes are "wang somewhere", "25 days", "li somewhere", "proposed". While the child node of "wang somewhere" is "minor length", "minor length" is "logistic". The "backoffice" may also be considered as a child node of "wang somebody".

In one embodiment, the computer may find a target word having an ADV relationship with the core relationship word, and if the part of speech of the target word is an adverb and is immediately next to the core Guan Jici, then the outputs are combined. If the target word has child nodes, the core relation word, the target word and all child nodes of the target word can be combined and output. Conversely, if the target word does not have a child node, only the core relationship word and the target word may be merged. Conversely, if the part of speech of the target word is an adverb, but not immediately adjacent to the core Guan Jici, the outputs are not merged.

In one embodiment, the computer may search for a target word having an ADV relationship with a core relationship word, determine a part of speech of an object forming a via relationship with the target word if the part of speech of the target word is a preposition, and determine whether to merge and output the core word and the vocabulary according to the part of speech of the vocabulary forming a specified dependency relationship with the core word if the object is a verb.

In the present embodiment, the core word may be regarded as an object of the target word, and the object is a verb. The object can be used as a sentence core, similar to the core related words, words forming a specified dependency relationship with the core words can be searched by adopting the same method, words having SBV/VOB/IOB/FOB/DBL/ADV/CMP/COO/POB relationship with the core words can be included, and the words can be considered as sub-nodes of the core words and include SBV node words, VOB node words, IOB node words, FOB node words, DBL node words, ADV node words, CMP node words, COO node words and POB node words.

In addition to the above in-shape structural relationship, according to the dependency relationship labels among different words, the computer can also search for a target word with a dynamic complement structural relationship with the core relationship word; and according to the tag information (including part-of-speech tags) of the target word, if the part-of-speech of the target word is an adjective, merging and outputting the core relation word and the target word. If the part of speech of the target word is an adjective, further judging whether the target word has a child node or not; if the target word has a child node, merging and outputting the core related word, the target word and the vocabulary corresponding to the child node. Conversely, if there is no child node, only the core relationship word and the target word may be merged and output.

In other embodiments, the computer searches for a target word having a dynamic complement structure relationship with the core relationship word, and if the part of speech of the target word is a verb, the computer may use the target word as the core word, and determine whether to perform the merging output of the core word and the vocabulary according to the part of speech of the vocabulary forming the specified dependency relationship with the core word.

In one embodiment, in addition to the above in-shape structure relationships, dynamic complement structure relationships, the computer may also find target words that have a master-predicate relationship, a dynamic guest relationship, or a pre-object relationship with the core relationship word. If the part of speech of the target word is a verb, the target word is used as a core word, and whether the core word and the vocabulary are combined and output is determined according to the part of speech of the vocabulary forming the appointed dependency relationship with the core word.

If the part of speech of the target word is noun, pronoun or number of words, the core relation word is not combined with the target word. Continuously judging whether the target word has child nodes or not; if the target word has the child node, merging and outputting the target word and the vocabulary corresponding to the child node.

In an embodiment, the computer may further search for a target word having a guest relationship or a bilingual relationship with the core relationship word; judging whether a child node exists in the target word according to the label information of the target word; if the target word has the child node, merging and outputting the target word and the vocabulary corresponding to the child node.

Fig. 3 is a detailed flowchart of a phrase extraction method according to an embodiment of the present application. As shown in fig. 3, the method comprises the steps of:

step S301, inputting a sentence to be processed.

And step S302, performing word segmentation, part-of-speech tagging and dependency syntax processing.

Step S303, judging whether HED node (sentence core) vocabulary is verb word class, if yes, entering the following step, otherwise, putting an invalid sentence subset.

Step S304, finding out a primary main sub-node of the HED node, namely an SBV node, and judging the part of speech of the SBV node (the vocabulary of which is directly represented by the node for convenience in description hereinafter) if the primary main sub-node is the SBV node; if not, go to the next step 305;

if the SBV node is noun, pronoun or number of word class, the SBV node is not combined with the HED node. If the SBV node is a child node, merging the vocabularies of all child nodes under the SBV node and the SBV node, keeping the part of speech of the SBV node, and outputting a result; if the SBV node does not have a child node, the part of speech of the SBV node is maintained, and a result is output;

if the SBV node is a preposition type, the SBV node is not combined with the HED node. Outputting the preposition and the part of speech thereof. And judging the part of speech of the sub-node POB node of the preposition. If the word is a part of speech, the POB node is used as a sentence core, and the sentence in which the POB node is located re-executes steps S303-S314. If the POB node is other word class, combining the POB node with all sub-nodes of the POB node, maintaining the part of speech of the POB node, outputting a result, and if the POB node does not have the sub-node, outputting the POB node and the part of speech thereof;

If the SBV node is a verb class, the SBV is not combined with the HED; the SBV node is used as a sentence core, and the step S303-step S314 is re-executed on the sentence;

if the SBV node is of another word class, the SBV is not merged with the HED. If the node is a child node, combining the SBV node with all child nodes under the node, keeping the part of speech of the SBV node, and outputting a result; if no child node exists, the part of speech of the SBV node is kept, and a result is output.

Step S305, finding the VOB node, if so, judging the part of speech of the VOB node; if not, go to the next step S306;

if the VOB node is a noun, a pronoun, a number of parts of speech, the VOB node is not merged with the HED. If the VOB node is a child node, merging the VOB node and all child nodes under the VOB node, keeping the part of speech of the VOB node, and outputting a result; if the VOB has no child node, the part of speech of the VOB node is maintained, and a result is output;

if the VOB node is a preposition class, the VOB node is not combined with the HED node. Outputting the preposition and the part of speech thereof. And judging the part of speech of the sub-node POB node of the preposition. If the word is a part of speech, the POB node is used as a sentence core, and the sentence in which the POB node is located re-executes steps S303-S314. If the POB is of other word types, combining the POB node and all the sub-nodes of the POB, keeping the part of speech of the POB, outputting a result, and outputting the POB node and the part of speech thereof under the condition that the POB has no sub-node;

If the VOB node is a verb class, the VOB node is not merged with the HED node. The VOB node is used as a sentence core, and the step S303 and the step S314 are re-executed on the sentence;

other parts of speech, VOB nodes are not merged with HED nodes. If the node is a child node, merging the VOB node and all child nodes under the node, keeping the part of speech of the VOB node, and outputting a result; if no child node exists, the part of speech of the VOB node is maintained, and a result is output.

Step S306, finding out the IOB node, if yes, the IOB node is not combined with the HED node; if not, go to the next step S307;

if the IOB node has the sub-nodes, merging the IOB node and all the sub-nodes of the IOB, keeping the part of speech of the IOB, and outputting a result;

if no child node exists, the part of speech of the IOB node is maintained, and a result is output.

Step S307, find FOB node, if there is, judge the part of speech of FOB node; if not, go to the next step S308;

if the FOB node is noun, pronoun or number of word class, the FOB node is not combined with the HED node. If the FOB node is a child node, merging the FOB node and all child nodes below the FOB node, keeping the part of speech of the FOB node, and outputting a result; if the FOB has no child node, the part of speech of the FOB node is maintained, and a result is output.

If the FOB node is a preposition, the FOB node is not combined with the HED node. Outputting the preposition and the part of speech thereof. And judging the part of speech of the sub-node POB node of the preposition. If the word is a part of speech, the POB node is used as a sentence core, and the sentence in which the POB node is located re-executes steps S303-S314. If the POB is of other word types, combining the POB node and all the sub-nodes of the POB, keeping the part of speech of the POB, outputting a result, and outputting the POB node and the part of speech thereof under the condition that the POB has no sub-node;

If the FOB node is a verb class, the FOB node is not merged with the HED node. The FOB node is used as a sentence core, and the step S303-step S314 is re-executed on the sentence;

if the FOB node is of other word class, the FOB node is not combined with the HED node. If the node is a child node, combining the FOB node with all the child nodes under the node, keeping the part of speech of the FOB node, and outputting a result; if the sub-node does not exist, the part of speech of the FOB node is maintained, and a result is output.

Step S308, finding a DBL node, if the DBL node exists, the DBL node is not combined with the HED node; if not, go to the next step S309;

if the DBL node has the child nodes, merging the DBL node and all the child nodes of the DBL, keeping the part of speech of the DBL and outputting a result;

if the DBL node has no child node, the part of speech of the DBL node is maintained, and a result is output.

Step S309, finding ADV nodes, if so, judging the part of speech of the ADV nodes; if not, go to the next step S310;

if the ADV node is an adverb class, it is determined whether it is next to the HED node. If the ADV node is close to the HED node, judging whether the ADV node has a child node, if yes, merging the child nodes of the ADV node and the ADV node, then merging the child nodes with the HED node, keeping the verb part of speech of the HED node, outputting a result, if not, merging the ADV node and the HED node, keeping the verb part of speech, and outputting; if the ADV node is not close to the HED node, the ADV node is not combined with the HED node, whether the ADV node has sub-nodes is judged, if yes, after the sub-nodes of the ADV node and the ADV node are combined, the part of speech of the ADV node is maintained, a result is output, if the ADV node has no sub-node, the part of speech of the ADV node is maintained, and the result is output;

If the ADV node is a preposition class, the ADV node is not merged with the HED node. Outputting the preposition and the part of speech thereof. And judging the part of speech of the sub-node POB node of the preposition. If the word is a part of speech, the POB node is used as a sentence core, and the sentence in which the POB node is located re-executes steps S303-S314. If the POB node is other word class, combining the POB node and all sub-nodes of the POB, keeping the part of speech of the POB, outputting a result, and outputting the POB node and the part of speech thereof under the condition that the POB has no sub-node;

if the ADV node is another word class, the ADV node is not merged with the HED node. If the ADV node has sub-nodes, merging the ADV node and all sub-nodes under the node, keeping the part of speech of the ADV node, and outputting a result; if the ADV node has no child node, the part of speech of the ADV node is maintained, and a result is output.

Step S310, finding CMP nodes, if so, judging the part of speech of the CMP nodes; if not, go to the next step S311;

if the CMP node is a verb class, the CMP node is not combined with the HED node, the CMP node is used as a sentence core, and the step S303-step S314 is re-executed on the sentence;

if the CMP node is a shape and word class, judging whether the CMP node has a node, if so, merging the CMP node with all the sub-nodes under the CMP node, merging with the HED node, keeping the part of speech of the HED node, and outputting a result; if the CMP node has no child node, the CMP node is combined with the HED node, the part of speech of the HED node is kept, and a result is output;

If the CMP node is of other word types, the CMP node is not combined with the HED, if the CMP node is provided with sub-nodes, the CMP node is combined with all sub-nodes under the node, the part of speech of the CMP node is kept, a result is output, if the CMP node is not provided with the sub-nodes, the part of speech of the CMP node is kept, and the result is output.

Step S311, finding POB nodes, if so, judging the part of speech of the POB nodes; if not, go to the next step S312;

if the POB node is a verb class, the POB node is not combined with the HED node, the POB node is used as a sentence core, and the step S303-step S314 is re-executed on the sentence;

if the POB node is of other word class, the POB node is not combined with the HED node, if the POB node has sub-nodes, the POB node is combined with all sub-nodes of the POB, the part of speech of the POB is kept, and a result is output; if the POB node has no child node, maintaining the part of speech of the POB and outputting the result.

Step S312 finds COO nodes, and if so, judges the part of speech of the COO nodes; if not, go to the next step S313;

if the COO node is a word, the COO node is not combined with the HED node, the COO node is used as a sentence core, and the step S303-S314 is re-executed on the sentence.

And if the COO node is other word class, putting the sentence to be processed into an invalid sentence subset.

Step S313: finding the secondary child LAD (left additional relationship node)/RAD (right additional relationship node) of the HED node, judging whether the node is already combined as part of the phrase in the previous steps, if so, then no more steps need to be performed, if not, then directly outputting the node and the part of speech.

Step S314, if the HED node is not combined with other components for output, outputting the HED node and the part of speech thereof.

Still taking the above-listed sentences to be processed as an example, tag information shown in table 2 can be obtained. The phrase extraction process is as follows:

1. a process of "recall" the core of the sentence;

1.1, finding a primary main child node of "holding", comprising: SBV/VOB/IOB/FOB/DBL/ADV/CMP/COO/POB;

SBV, wang certain;

VOB: some kind of plum;

ADV: day of the year;

COO: proposed;

1.2, judging the part of speech of the SBV node;

wang some_nh is a name part of speech, so it is not merged with HED node;

"wang somebody" has child nodes, so merge all child nodes under that node, so merge child nodes "minor length", "minor length" child node "logistic portion";

and (3) outputting: the minor part of the logistics part grows a certain length of nh;

1.3, judging the part of speech of the VOB node;

Some_nh is a name part of speech, so it is not combined with HED node;

a node is included in a certain plum, so all the sub-nodes under the node are combined, so the sub-nodes are combined, namely, a large sub-node is combined, a large sub-node is resident in a certain country, and a resident sub-node is bloom;

and (3) outputting: the bloom is greatly caused to be somewhere_nh;

1.4, judging the part of speech of the ADV node;

25 day_nt, other word class, so not merged with HED node;

"25 days" has no child node, so output: 25 days_nt;

1.5, judging the part of speech of COO nodes;

proposing_v, which is a verb word class, into a process of 'proposing' a sentence core;

1.6, finding a secondary child node of "holding";

the method is free;

2. a process of "proposing" a sentence core;

2.1, finding out a first-level child node of 'proposed';

VOB: the negotiation_n;

ADV: just_p;

COO: urging_v;

2.2, judging the part of speech of the VOB node;

the interaction_n is a part of the name, so it is not merged with HED;

"consult" has children, so all children under the node are merged, so merging children "strictly" children "resists", "resistant" children "and" "strongly";

and (3) outputting: strictly orthogonal and strongly resistant to n;

2.3, judging the part of speech of the ADV node;

just_p, is a medium class, so is not merged with the HED node;

and (3) outputting: just_p;

judging the part of speech of the POB node, determining that the POB node is an animal part of speech through a_v label, and entering a process of 'passing' through the core of a sentence;

2.4, judging the part of speech of COO nodes;

promoting_v, which is a verb class, into the flow of promoting the core of a sentence;

2.5, finding a secondary child node of the 'proposed';

the method is free;

3. a flow "through" the core of the sentence;

3.1, finding out a first-level child node of 'pass';

judging a primary node of pass;

VOB: act_n;

3.2, judging the part of speech of the VOB node;

act_n is a part of the name, so it is not merged with HED;

the act has child nodes, so all child nodes under the node are merged, so the child nodes of the 'economy' and the 'trade' are merged;

and (3) outputting: so-called regional economy and trade act_n;

3.3, finding a secondary child node of 'pass';

the method is free;

4. a process of "prompting" the core of the sentence;

4.1, finding out a first-level child node which is promoted;

DBL, certain side_n;

VOB, correction_v;

4.2, judging whether the DBL node has a child node

The DBL node is not merged with the HED node, and the DBL node has no child nodes.

And (3) outputting: party n

4.3 judging the part of speech of VOB

Correction_v is a verb class, and enters the flow of "correcting" the core of a sentence.

4.4 find "corrected" secondary child node

Without any means for

5. Process for correcting sentence core

5.1, find out the primary child node of "correction

VOB: error_n

ADV: immediate_d

5.2 judging the part of speech of VOB

Error_n is a part of the name and is therefore not merged with the HED.

There are no child nodes, so output: error_n.

5.3 judging the part of speech of ADV

Immediate_d, a part of speech, immediately adjacent to HED, without child nodes

And (3) outputting: immediate correction of_v

5.4 find "corrected" secondary child node

Without any means for

Through the above 5 sentence core processes, the final output result is sorted as follows:

the minor part of the logistics part grows a certain length of nh;

25 days_nt;

recall_v;

a country is resided in a large way that the plum is somewhere_nh;

just_p;

a certain organization_n;

pass_v;

so-called regional economy and trade act_n;

proposing_v;

strictly orthogonal and strongly resistant to n;

urging_v;

some party n;

immediately correct_v;

error_n.

That is, the phrases can be output simultaneously with the corresponding parts of speech of the phrases.

With another sentence to be processed, related reform and innovation measure, specific effects are monitored timely, and adjustment and popularization work is performed by selecting a machine. By way of example only, the term "as used herein,

the word segmentation and part of speech tagging results are:

regarding the_p-related_n reform_n and the_c innovation_n measure_n, _w is not only_d but also_v is timely_d to monitor the_v specific_a effect, _w_dTo_v_v_b machine_v does_vgood_ a adjusts n and c promotes n work n. W (w)

The dependency syntax result is:

sequence number	Words and phrases	Part of speech	Father node	Dependency relationship
					1	As for	p	11	ADV
2	Correlation of	n	6	ATT
					3	Reform	n	6	ATT
4	And (3) with	c	5	LAD
					5	Innovative innovation	n	3	COO
6	Measure of action	n	1	POB
					7	，	wp	1	WP
8	Not only is provided with	d	9	ADV
					9	To be used for the preparation of	v	11	ADV
10	Timely time	d	11	ADV
					11	Monitoring	v	0	HED
12	Concrete embodiments	a	13	ATT
					13	Success rate	n	11	VOB
14	，	wp	11	WP
					15	Also is provided with	d	16	ADV
16	To be used for the preparation of	v	17	ADV
					17	Machine selection	v	11	COO
18	Doing the following steps	v	17	COO
					19	Good (good)	a	18	CMP
20	Adjustment of	n	23	ATT
					21	And	c	22	LAD
22	promotion of	n	20	COO
					23	Work of	n	18	VOB
24	。	wp	18	WP

The phrase extraction process is as follows:

flow of "monitoring" sentence core

1.1, finding out a primary child node of 'monitoring';

VOB: achievement_n;

ADV: timely_d;

ADV: want_v;

ADV: as for_p;

COO: selecting a machine_v;

1.2, judging the part of speech of the VOB node;

the achievement_n is a part of the name and therefore is not merged with the HED node. "success" has child nodes, so all child nodes under that node are merged, so "concrete" and "success" are merged;

and (3) outputting: specific achievement_n;

1.3, judging the part of speech of the ADV node;

timely_d is an adverb class, is close to the HED, has no child node, and is combined with the HED node by the ADV node;

Output is therefore: monitoring_v at the right time;

the main_v is a verb class and enters the flow of the core of a main sentence;

as for_p, is a medium class, so it is not merged with the HED node;

and (3) outputting: just_p;

judging the part of speech of a sub-node POB node of the ADV node, wherein the action_n is other parts of speech, and the POB node is provided with sub-nodes, so that the actions related to the sub-nodes are combined;

and (3) outputting: related reform and innovation measure_n;

1.4, judging the part of speech of COO nodes;

the choose_v is a verb class, and enters the flow of the "choose" sentence core.

1.5, finding a secondary child node of 'monitoring';

the method is free;

2. the flow of the "to" sentence core;

2.1, finding out a primary child node which is 'to be' from the node;

ADV: not only d;

2.2, judging the part of speech of the ADV node;

the part of speech of HED node is kept, and the part of speech of HED node is output;

output is therefore: anamnesis_v;

3. the flow of the "choose machine" sentence core;

3.1, finding out a first-level child node of 'selecting a machine';

ADV: want_v;

COO: doing_v;

3.2, judging the part of speech of the ADV node;

The main_v is a verb class and enters the flow of the core of a main sentence;

3.3, judging the part of speech of COO nodes;

doing_v is a verb class, and entering a process of doing the core of a sentence;

4. the flow of the "to" sentence core;

4.1, finding out a primary child node which is 'to be' from the node;

ADV: also_d;

also_d, which is an adverb class, is next to the HED node, is not a child node, is combined with the HED to keep the part of speech of the HED node, and outputs a result;

output is therefore: also_v;

5. a process of "do" sentence core;

5.1, finding out a primary child node which is 'done';

VOB: work_n;

CMP: good_a;

5.2, judging the part of speech of the VOB node;

work_n is a part of the name and therefore is not merged with the HED node. The work has child nodes, so all child nodes under the node are combined, and the child nodes of adjustment and adjustment are combined to be promoted and the child nodes of promotion are combined;

and (3) outputting: adjusting and popularizing work_n;

5.3, judging the part of speech of the CMP node;

good_a, which is an adjective part of speech, "good" has no child node, and is combined with HED nodes, so that the part of speech of the HED nodes is maintained, and a result is output;

output is therefore: done_v;

As for_p;

related reform and innovation measure_n;

anamnesis_v;

monitoring_v at the right time;

specific achievement_n;

also_v;

selecting a machine_v;

done_v;

adjusting and popularizing work_n;

the application relates to a method for extracting phrases aiming at sentences with cores as verbs, which utilizes three NLP basic modules of word segmentation, part of speech and dependency syntax to analyze grammar of modern Chinese and extract fixed usage, thereby forming a merging rule. The corpus is not needed to be trained, the combination is flexible, and the granularity of the phrases can be changed according to the increasing and decreasing steps.

The following is an embodiment of the apparatus of the present application, which may be used to execute the phrase extraction method embodiment of the present application described above. For details not disclosed in the apparatus embodiments of the present application, please refer to the phrase extraction method embodiments of the present application.

Fig. 4 is a block diagram of a phrase extraction apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus includes: sentence acquisition module 410, tag generation module 420, part of speech determination module 430, target word lookup module 440, and merge determination module 450.

A sentence acquisition module 410, configured to acquire a sentence to be processed;

the tag generation module 420 is configured to sequentially perform word segmentation, part-of-speech tagging and dependency syntax processing on the sentence to be processed, and generate dependency relationship tags between different words and part-of-speech tags of each word;

The part-of-speech judging module 430 is configured to judge whether a core relationship word with a core relationship tag is a verb according to the dependency relationship tags among different words and the part-of-speech tag of each word;

a target word searching module 440, configured to search a target word that forms a specified dependency relationship with the core relationship word when the core relationship word is a verb;

and the merging judgment module 450 is used for determining whether to merge and output the core relation word and the target word according to the label information of the target word.

The implementation process of the functions and roles of each module in the above device is specifically detailed in the implementation process of the corresponding steps in the phrase extraction method, and will not be described herein.

In the several embodiments provided in the present application, the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A phrase extraction method, the method comprising:

Acquiring sentences to be processed;

determining whether to combine and output the core relation word and the target word according to the label information of the target word;

wherein the searching the target word forming the appointed dependency relationship with the core relationship word comprises the following steps:

if the part of speech of the target word is a verb, using the target word as a core word, and determining whether to combine and output the core word and the vocabulary according to the part of speech of the vocabulary forming the appointed dependency relationship with the core word;

Judging the part of speech of an object forming a mediate relation with the target word if the part of speech of the target word is a preposition, and determining whether to combine and output the core word and the vocabulary according to the part of speech of the vocabulary forming the appointed dependency relation with the core word by taking the object as the core word if the part of speech of the object is a verb;

2. The method of claim 1, wherein the locating the target word that constitutes a specified dependency with the core relationship word comprises:

3. The method of claim 2, wherein if the part of speech of the target word is an adverb and is adjacent to the core relation word, combining and outputting the core relation word and the target word comprises:

4. The method according to claim 2, wherein determining whether to perform the merging output of the core relation word and the target word according to the tag information of the target word includes:

5. The method according to claim 2, wherein determining whether to perform the merging output of the core relation word and the target word according to the tag information of the target word includes:

6. The method of claim 1, wherein the locating the target word that constitutes a specified dependency with the core relationship word comprises:

7. The method of claim 6, wherein if the part of speech of the target word is an adjective, combining the core relationship word and the target word and outputting, comprises:

8. The method of claim 6, wherein determining whether to perform the merging output of the core relationship word and the target word according to the tag information of the target word comprises:

9. The method of claim 1, wherein the locating the target word that constitutes a specified dependency with the core relationship word comprises:

10. The method as recited in claim 1, further comprising:

and outputting the corresponding part of speech of the phrase while outputting the phrase.

11. A phrase extraction apparatus, the apparatus comprising:

the target word searching module is used for searching target words forming specified dependency relationship with the core relationship words when the core relationship words are verbs; searching target words with main-predicate relation, guest relation or prepositive object relation with the core relation words;

the merging judgment module is used for determining whether to merge and output the core relation word and the target word according to the label information of the target word; if the part of speech of the target word is a verb, using the target word as a core word, and determining whether to combine and output the core word and the vocabulary according to the part of speech of the vocabulary forming the appointed dependency relationship with the core word; judging the part of speech of an object forming a mediate relation with the target word if the part of speech of the target word is a preposition, and determining whether to combine and output the core word and the vocabulary according to the part of speech of the vocabulary forming the appointed dependency relation with the core word by taking the object as the core word if the part of speech of the object is a verb; if the part of speech of the target word is noun, pronoun or number word, judging whether the target word has child nodes or not; if the target word has the child node, merging and outputting the target word and the vocabulary corresponding to the child node.

12. An electronic device, the electronic device comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the phrase extraction method of any one of claims 1-10.