CN110134942B

CN110134942B - Text hotspot extraction method and device

Info

Publication number: CN110134942B
Application number: CN201910260924.8A
Authority: CN
Inventors: 王宇琪; 孔庆超; 黄秋曼; 方省; 曹家; 罗引; 王磊; 赵菲菲; 张西娜
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2020-10-23
Anticipated expiration: 2039-04-01
Also published as: CN110134942A

Abstract

The embodiment of the invention relates to a text hotspot extraction method and a text hotspot extraction device, wherein the method comprises the following steps: segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining the similarity between any two text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the short sentence formed by syntactically analyzing and extracting the relation words improves the observability and accuracy of information extraction, so that a user can better understand the text content to obtain core key information points, the short sentence is vectorized through Word2vec to carry out similarity comparison, and semantic information among words is kept, so that the accuracy of the duplication elimination work is ensured, and the redundancy of hot spot information is avoided as much as possible.

Description

Text hotspot extraction method and device

Technical Field

The embodiment of the invention relates to the field of data processing, in particular to a text hotspot extracting method and device.

Background

The hot spot extraction is to extract a core abstract short sentence as a classification category according to a known text, so that a user can quickly find an interested event topic on an application platform and acquire related information. In order to improve the accuracy of information extraction and promote the comprehensibility of an extraction result, the scheme proposes that a dependency syntax analysis technology-based method is adopted to realize the extraction of short texts with semantic comprehension; and merging the extraction results based on the similarity technique.

At present, more related information extraction tasks are performed based on a keyword extraction technology, and the keyword can be a single word or a phrase formed by a plurality of words and is the minimum unit for expressing the theme meaning of the document.

However, keyword extraction can only identify the most representative segment or vocabulary of a document for a certain event or topic, and cannot accurately reflect the whole content of the text.

Disclosure of Invention

In view of the above, in order to solve the technical problem or at least partially solve the technical problem, embodiments of the present invention provide a text hotspot extracting method and apparatus.

In a first aspect, an embodiment of the present invention provides a text hotspot extracting method, including:

the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;

generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;

vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;

determining a similarity between any two of the text vectors based on a similarity algorithm;

and merging the two text vectors with the similarity larger than a similarity threshold value.

In one possible embodiment, the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm includes:

determining the second short text data to perform sentence component analysis based on an HANLP dependency syntax analysis algorithm;

determining words with labeled semantic relations in the second short text data;

selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;

and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.

In a possible embodiment, the vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors includes:

performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word segmentation result by adopting Word2 vec;

and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

In one possible embodiment, the determining the similarity between any two text vectors based on a similarity algorithm includes:

calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

In a possible embodiment, the merging the two text vectors with the similarity greater than the similarity threshold includes:

when the similarity is larger than a similarity threshold value, determining that the two text vectors belong to the same topic;

rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.

In a second aspect, an embodiment of the present invention provides a text hotspot extracting device, including:

the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;

the generating module is used for generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;

the processing module is used for vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;

the calculation module is used for determining the similarity between any two text vectors based on a similarity algorithm;

and the merging module is used for merging the two text vectors with the similarity greater than a similarity threshold value.

In a possible embodiment, the generating module is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.

In a possible implementation manner, the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization processing on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

In a possible implementation manner, the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

In a possible embodiment, the merging module is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.

According to the text hotspot extraction scheme provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.

Drawings

Fig. 1 is a schematic flowchart of a text hotspot extracting method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating generation of fourth short text data according to an embodiment of the present invention;

FIG. 3 is a diagram of a directed graph of component relationships according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.

Fig. 1 is a schematic flowchart of a text hotspot extracting method provided in an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:

and S11, segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data.

In this embodiment, the at least one input text data may be internet text data acquired by a crawler, based on the input of a single text data, the document is divided according to a specified format by using a regular expression based on punctuation marks, a plurality of first short text data obtained by dividing the document are returned, and a short sentence list is generated, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold, and the character length corresponding to the third short text data is not greater than the set character threshold.

It should be noted that the character threshold may be set according to actual requirements, such as 2, 4, 6, 8, 10, and the like, and this embodiment is not limited in particular.

And S12, generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm.

The specific steps of generating the fourth short text data may refer to fig. 2, and specifically include:

and S21, determining the second short text data to perform sentence component analysis based on the HANLP dependency syntax analysis algorithm.

And S22, determining words with labeled semantic relations in the second short text data.

Based on the HANLP dependency syntactic analysis algorithm, word segmentation and part-of-speech tagging are carried out on the second short text data, a directed graph with syntactic component relations is generated, the semantic relations of each word in a sentence are tagged, and the generated directed graph with the component relations can refer to FIG. 3.

S23, selecting the main and subordinate relation words, the moving object relation words and the core relation words representing the second short text data from the words with the labeled semantic relations.

And S24, generating corresponding fourth short text data according to the main and predicate relationship words, the moving and guest relationship words and the core relationship words.

Core relation words, major-predicate elements and dynamic guest elements in corresponding sentences can be extracted based on a dependency syntax analysis algorithm and are used for forming new short sentences with semantic structures and used for abstract extraction tasks of documents. The dependency syntax analysis meets the condition that only one component in each sentence is independent, other components in the sentence are all affiliated to a certain component, any one component cannot depend on two or more components, and other components on the left side and the right side of the core relation word are not in relation with each other.

And performing component analysis on the sentence based on an HANLP dependency syntax analysis algorithm to obtain words with labeled semantic relationship types, selecting words which can directly describe the subject matter of the sentence content in the sentence, such as a main-and-predicate relationship, a moving-object relationship, a core relationship and the like, and combining the words to form new fourth short text data.

And S13, vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors.

Specifically, Word segmentation is carried out on the third short text data and the fourth short text data, and vectorization processing is carried out on a Word2vec result after Word segmentation; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

And respectively carrying out Word segmentation on the short sentences obtained in the step by adopting Word2vec Word vector square, converting the Word into Word-based vector units, and calculating the average Word vector of a single short sentence. The calculation formula is as follows,

where Vi is the word vector and Pv is the phrase vector.

And S14, determining the similarity between any two text vectors based on a similarity algorithm.

Specifically, the similarity between any two text vectors is calculated by adopting a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

And S15, merging the two text vectors with the similarity greater than the similarity threshold value.

Specifically, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.

Further, the similarity threshold may be set according to actual requirements, such as 0.6, 0.8, and the like, which is not limited in this embodiment.

When the similarity is larger than the similarity threshold, the two short text data are determined to belong to the same topic, so that hot spot information is merged, the problem of repeated topics during integration of extracted hot spot short sentences is avoided, and when the similarity is not larger than the similarity threshold, the two short text data are determined to be irrelevant.

According to the text hotspot extraction method provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.

Fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 4, the device specifically includes:

a segmentation module 401, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;

a generating module 402, configured to generate, by using a dependency parsing algorithm, corresponding fourth short text data from the second short text data;

a processing module 403, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain multiple corresponding text vectors;

a calculating module 404, configured to determine a similarity between any two text vectors based on a similarity algorithm;

a merging module 405, configured to merge the two text vectors with the similarity greater than a similarity threshold.

Optionally, the generating module 402 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.

Optionally, the processing module 403 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

Optionally, the calculating module 404 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

Optionally, the merging module 405 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.

The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 4, and may perform all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.

Fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 5, the device specifically includes:

processor 510, memory 520, transceiver 530.

Processor 510 may be a Central Processing Unit (CPU), or a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory 520 is used to store various applications, operating systems, and data. The memory 520 may transfer the stored data to the processor 510. The memory 520 may include a volatile memory, a nonvolatile dynamic random access memory (NVRAM), a phase change random access memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, such as at least one magnetic disk memory device, an electrically erasable programmable read-only memory (EEPROM), a flash memory device, such as a flash memory (NOR) or a flash memory (NAND), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 520 may also comprise a combination of memories of the kind described above.

A transceiver 530 for transmitting and/or receiving data, the transceiver 530 may be an antenna, etc.

The working process of each device is as follows:

a processor 510, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;

a processor 510, configured to generate a fourth corresponding short text data from the second short text data by using a dependency parsing algorithm;

a processor 510, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;

a processor 510 for determining a similarity between any two of the text vectors based on a similarity algorithm;

and the processor 510 is configured to perform merging processing on the two text vectors with the similarity greater than a similarity threshold.

Optionally, the processor 510 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.

Optionally, the processor 510 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

Optionally, the processor 510 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

Optionally, the processor 510 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.

The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 5, and may execute all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.

The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.

When one or more programs in the storage medium are executable by one or more processors, the text hotspot extraction method executed on the text hotspot extraction device side is realized.

The processor is configured to execute the text hotspot extracting program stored in the memory, so as to implement the following steps of the text hotspot extracting method executed on the text hotspot extracting device side:

Optionally, determining the second short text data based on an HANLP dependency syntax analysis algorithm for sentence component analysis;

Optionally, performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word-segmented result by using Word2 vec;

Optionally, the determining the similarity between any two text vectors based on a similarity algorithm includes:

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

Optionally, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic;

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A text hotspot extraction method is characterized by comprising the following steps:

the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;

2. The method according to claim 1, wherein the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm comprises:

3. The method of claim 2, wherein vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors comprises:

4. The method of claim 1, wherein determining the similarity between any two text vectors based on a similarity algorithm comprises:

wherein the first formula is:

wherein,A_ias text vectors A, B_iIs a text vector B.

5. The method according to claim 4, wherein the merging the two text vectors with the similarity greater than the similarity threshold comprises:

6. A text hotspot extracting device is characterized by comprising:

the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;

7. The apparatus according to claim 6, wherein the generating module is specifically configured to determine the second short text data for sentence component analysis based on an HANLP dependency parsing algorithm; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.

8. The apparatus according to claim 7, wherein the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.

9. The apparatus according to claim 6, wherein the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;

wherein the first formula is:

wherein A is_iAs text vectors A, B_iIs a text vector B.

10. The apparatus according to claim 9, wherein the merging module is specifically configured to determine that the two text vectors belong to a topic of the same type when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.