CN110134942B - Text hotspot extraction method and device - Google Patents
Text hotspot extraction method and device Download PDFInfo
- Publication number
- CN110134942B CN110134942B CN201910260924.8A CN201910260924A CN110134942B CN 110134942 B CN110134942 B CN 110134942B CN 201910260924 A CN201910260924 A CN 201910260924A CN 110134942 B CN110134942 B CN 110134942B
- Authority
- CN
- China
- Prior art keywords
- text data
- short text
- similarity
- short
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 21
- 239000013598 vector Substances 0.000 claims abstract description 93
- 238000004458 analytical method Methods 0.000 claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000012545 processing Methods 0.000 claims description 26
- 230000011218 segmentation Effects 0.000 claims description 25
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000008030 elimination Effects 0.000 abstract 1
- 238000003379 elimination reaction Methods 0.000 abstract 1
- 230000015654 memory Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention relates to a text hotspot extraction method and a text hotspot extraction device, wherein the method comprises the following steps: segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining the similarity between any two text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the short sentence formed by syntactically analyzing and extracting the relation words improves the observability and accuracy of information extraction, so that a user can better understand the text content to obtain core key information points, the short sentence is vectorized through Word2vec to carry out similarity comparison, and semantic information among words is kept, so that the accuracy of the duplication elimination work is ensured, and the redundancy of hot spot information is avoided as much as possible.
Description
Technical Field
The embodiment of the invention relates to the field of data processing, in particular to a text hotspot extracting method and device.
Background
The hot spot extraction is to extract a core abstract short sentence as a classification category according to a known text, so that a user can quickly find an interested event topic on an application platform and acquire related information. In order to improve the accuracy of information extraction and promote the comprehensibility of an extraction result, the scheme proposes that a dependency syntax analysis technology-based method is adopted to realize the extraction of short texts with semantic comprehension; and merging the extraction results based on the similarity technique.
At present, more related information extraction tasks are performed based on a keyword extraction technology, and the keyword can be a single word or a phrase formed by a plurality of words and is the minimum unit for expressing the theme meaning of the document.
However, keyword extraction can only identify the most representative segment or vocabulary of a document for a certain event or topic, and cannot accurately reflect the whole content of the text.
Disclosure of Invention
In view of the above, in order to solve the technical problem or at least partially solve the technical problem, embodiments of the present invention provide a text hotspot extracting method and apparatus.
In a first aspect, an embodiment of the present invention provides a text hotspot extracting method, including:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
In one possible embodiment, the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm includes:
determining the second short text data to perform sentence component analysis based on an HANLP dependency syntax analysis algorithm;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
In a possible embodiment, the vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors includes:
performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word segmentation result by adopting Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
In one possible embodiment, the determining the similarity between any two text vectors based on a similarity algorithm includes:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
In a possible embodiment, the merging the two text vectors with the similarity greater than the similarity threshold includes:
when the similarity is larger than a similarity threshold value, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
In a second aspect, an embodiment of the present invention provides a text hotspot extracting device, including:
the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
the generating module is used for generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
the processing module is used for vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
the calculation module is used for determining the similarity between any two text vectors based on a similarity algorithm;
and the merging module is used for merging the two text vectors with the similarity greater than a similarity threshold value.
In a possible embodiment, the generating module is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
In a possible implementation manner, the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization processing on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
In a possible implementation manner, the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
In a possible embodiment, the merging module is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
According to the text hotspot extraction scheme provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.
Drawings
Fig. 1 is a schematic flowchart of a text hotspot extracting method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating generation of fourth short text data according to an embodiment of the present invention;
FIG. 3 is a diagram of a directed graph of component relationships according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Fig. 1 is a schematic flowchart of a text hotspot extracting method provided in an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:
and S11, segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data.
In this embodiment, the at least one input text data may be internet text data acquired by a crawler, based on the input of a single text data, the document is divided according to a specified format by using a regular expression based on punctuation marks, a plurality of first short text data obtained by dividing the document are returned, and a short sentence list is generated, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold, and the character length corresponding to the third short text data is not greater than the set character threshold.
It should be noted that the character threshold may be set according to actual requirements, such as 2, 4, 6, 8, 10, and the like, and this embodiment is not limited in particular.
And S12, generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm.
The specific steps of generating the fourth short text data may refer to fig. 2, and specifically include:
and S21, determining the second short text data to perform sentence component analysis based on the HANLP dependency syntax analysis algorithm.
And S22, determining words with labeled semantic relations in the second short text data.
Based on the HANLP dependency syntactic analysis algorithm, word segmentation and part-of-speech tagging are carried out on the second short text data, a directed graph with syntactic component relations is generated, the semantic relations of each word in a sentence are tagged, and the generated directed graph with the component relations can refer to FIG. 3.
S23, selecting the main and subordinate relation words, the moving object relation words and the core relation words representing the second short text data from the words with the labeled semantic relations.
And S24, generating corresponding fourth short text data according to the main and predicate relationship words, the moving and guest relationship words and the core relationship words.
Core relation words, major-predicate elements and dynamic guest elements in corresponding sentences can be extracted based on a dependency syntax analysis algorithm and are used for forming new short sentences with semantic structures and used for abstract extraction tasks of documents. The dependency syntax analysis meets the condition that only one component in each sentence is independent, other components in the sentence are all affiliated to a certain component, any one component cannot depend on two or more components, and other components on the left side and the right side of the core relation word are not in relation with each other.
And performing component analysis on the sentence based on an HANLP dependency syntax analysis algorithm to obtain words with labeled semantic relationship types, selecting words which can directly describe the subject matter of the sentence content in the sentence, such as a main-and-predicate relationship, a moving-object relationship, a core relationship and the like, and combining the words to form new fourth short text data.
And S13, vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors.
Specifically, Word segmentation is carried out on the third short text data and the fourth short text data, and vectorization processing is carried out on a Word2vec result after Word segmentation; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
And respectively carrying out Word segmentation on the short sentences obtained in the step by adopting Word2vec Word vector square, converting the Word into Word-based vector units, and calculating the average Word vector of a single short sentence. The calculation formula is as follows,
where Vi is the word vector and Pv is the phrase vector.
And S14, determining the similarity between any two text vectors based on a similarity algorithm.
Specifically, the similarity between any two text vectors is calculated by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
And S15, merging the two text vectors with the similarity greater than the similarity threshold value.
Specifically, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
Further, the similarity threshold may be set according to actual requirements, such as 0.6, 0.8, and the like, which is not limited in this embodiment.
When the similarity is larger than the similarity threshold, the two short text data are determined to belong to the same topic, so that hot spot information is merged, the problem of repeated topics during integration of extracted hot spot short sentences is avoided, and when the similarity is not larger than the similarity threshold, the two short text data are determined to be irrelevant.
According to the text hotspot extraction method provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.
Fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 4, the device specifically includes:
a segmentation module 401, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
a generating module 402, configured to generate, by using a dependency parsing algorithm, corresponding fourth short text data from the second short text data;
a processing module 403, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain multiple corresponding text vectors;
a calculating module 404, configured to determine a similarity between any two text vectors based on a similarity algorithm;
a merging module 405, configured to merge the two text vectors with the similarity greater than a similarity threshold.
Optionally, the generating module 402 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, the processing module 403 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the calculating module 404 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, the merging module 405 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 4, and may perform all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.
Fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 5, the device specifically includes:
The memory 520 is used to store various applications, operating systems, and data. The memory 520 may transfer the stored data to the processor 510. The memory 520 may include a volatile memory, a nonvolatile dynamic random access memory (NVRAM), a phase change random access memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, such as at least one magnetic disk memory device, an electrically erasable programmable read-only memory (EEPROM), a flash memory device, such as a flash memory (NOR) or a flash memory (NAND), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 520 may also comprise a combination of memories of the kind described above.
A transceiver 530 for transmitting and/or receiving data, the transceiver 530 may be an antenna, etc.
The working process of each device is as follows:
a processor 510, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
a processor 510, configured to generate a fourth corresponding short text data from the second short text data by using a dependency parsing algorithm;
a processor 510, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
a processor 510 for determining a similarity between any two of the text vectors based on a similarity algorithm;
and the processor 510 is configured to perform merging processing on the two text vectors with the similarity greater than a similarity threshold.
Optionally, the processor 510 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, the processor 510 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the processor 510 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, the processor 510 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 5, and may execute all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors, the text hotspot extraction method executed on the text hotspot extraction device side is realized.
The processor is configured to execute the text hotspot extracting program stored in the memory, so as to implement the following steps of the text hotspot extracting method executed on the text hotspot extracting device side:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
Optionally, determining the second short text data based on an HANLP dependency syntax analysis algorithm for sentence component analysis;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word-segmented result by using Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the determining the similarity between any two text vectors based on a similarity algorithm includes:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A text hotspot extraction method is characterized by comprising the following steps:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
2. The method according to claim 1, wherein the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm comprises:
determining the second short text data to perform sentence component analysis based on an HANLP dependency syntax analysis algorithm;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
3. The method of claim 2, wherein vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors comprises:
performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word segmentation result by adopting Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
4. The method of claim 1, wherein determining the similarity between any two text vectors based on a similarity algorithm comprises:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein,Aias text vectors A, BiIs a text vector B.
5. The method according to claim 4, wherein the merging the two text vectors with the similarity greater than the similarity threshold comprises:
when the similarity is larger than a similarity threshold value, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
6. A text hotspot extracting device is characterized by comprising:
the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;
the generating module is used for generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
the processing module is used for vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
the calculation module is used for determining the similarity between any two text vectors based on a similarity algorithm;
and the merging module is used for merging the two text vectors with the similarity greater than a similarity threshold value.
7. The apparatus according to claim 6, wherein the generating module is specifically configured to determine the second short text data for sentence component analysis based on an HANLP dependency parsing algorithm; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
8. The apparatus according to claim 7, wherein the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
9. The apparatus according to claim 6, wherein the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
wherein A isiAs text vectors A, BiIs a text vector B.
10. The apparatus according to claim 9, wherein the merging module is specifically configured to determine that the two text vectors belong to a topic of the same type when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910260924.8A CN110134942B (en) | 2019-04-01 | 2019-04-01 | Text hotspot extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910260924.8A CN110134942B (en) | 2019-04-01 | 2019-04-01 | Text hotspot extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134942A CN110134942A (en) | 2019-08-16 |
CN110134942B true CN110134942B (en) | 2020-10-23 |
Family
ID=67569047
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910260924.8A Active CN110134942B (en) | 2019-04-01 | 2019-04-01 | Text hotspot extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134942B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110874531B (en) * | 2020-01-20 | 2020-07-10 | 湖南蚁坊软件股份有限公司 | Topic analysis method and device and storage medium |
CN112069785A (en) * | 2020-08-06 | 2020-12-11 | 北京明略昭辉科技有限公司 | Text sampling method and device for improving labeling efficiency |
CN112101008A (en) * | 2020-09-27 | 2020-12-18 | 北京百度网讯科技有限公司 | Text popularity determination method and device, electronic equipment and storage medium |
CN112183111B (en) * | 2020-09-28 | 2024-08-23 | 亚信科技(中国)有限公司 | Long text semantic similarity matching method, device, electronic equipment and storage medium |
CN112364641A (en) * | 2020-11-12 | 2021-02-12 | 北京中科闻歌科技股份有限公司 | Chinese countermeasure sample generation method and device for text audit |
CN113052487A (en) * | 2021-04-12 | 2021-06-29 | 平安国际智慧城市科技股份有限公司 | Evaluation text processing method and device and computer equipment |
CN113139375A (en) * | 2021-04-21 | 2021-07-20 | 洛阳墨潇网络科技有限公司 | Paper similarity detection method and device based on big data |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319668A (en) * | 2018-01-23 | 2018-07-24 | 义语智能科技(上海)有限公司 | Generate the method and apparatus of text snippet |
CN109101489A (en) * | 2018-07-18 | 2018-12-28 | 武汉数博科技有限责任公司 | A kind of text automatic abstracting method, device and a kind of electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372208B (en) * | 2016-09-05 | 2019-07-12 | 东南大学 | A kind of topic viewpoint clustering method based on statement similarity |
-
2019
- 2019-04-01 CN CN201910260924.8A patent/CN110134942B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108319668A (en) * | 2018-01-23 | 2018-07-24 | 义语智能科技(上海)有限公司 | Generate the method and apparatus of text snippet |
CN109101489A (en) * | 2018-07-18 | 2018-12-28 | 武汉数博科技有限责任公司 | A kind of text automatic abstracting method, device and a kind of electronic equipment |
Non-Patent Citations (2)
Title |
---|
K Nearest Neighbor for Text Summarization using Feature Similarity;Taeho Jo;《2017 International Conference on Communication, Control, Computing and Electronics Engineering》;20170116;第1-5页 * |
基于拓扑结构的微博话题摘要生成算法;赵斌 等;《数据采集与处理》;20140930;第29卷(第5期);第720-729页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110134942A (en) | 2019-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134942B (en) | Text hotspot extraction method and device | |
Bhargava et al. | Sentiment analysis for mixed script indic sentences | |
CN107085581B (en) | Short text classification method and device | |
US10360294B2 (en) | Methods and systems for efficient and accurate text extraction from unstructured documents | |
Sadat et al. | Automatic identification of arabic dialects in social media | |
CN107341143B (en) | Sentence continuity judgment method and device and electronic equipment | |
US10528664B2 (en) | Preserving and processing ambiguity in natural language | |
CN105095204A (en) | Method and device for obtaining synonym | |
JP2020126493A (en) | Paginal translation processing method and paginal translation processing program | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN112784009B (en) | Method and device for mining subject term, electronic equipment and storage medium | |
Abinaya et al. | Amrita_cen@ fire-2014: Named entity recognition for indian languages using rich features | |
CN108052509A (en) | A kind of Text similarity computing method, apparatus and server | |
Noshin Jahan et al. | Bangla real-word error detection and correction using bidirectional lstm and bigram hybrid model | |
CN113449084A (en) | Relationship extraction method based on graph convolution | |
Wong et al. | iSentenizer‐μ: Multilingual Sentence Boundary Detection Model | |
Nararatwong et al. | Improving Thai word and sentence segmentation using linguistic knowledge | |
Castro et al. | Discriminating between Brazilian and European Portuguese national varieties on Twitter texts | |
CN111160445B (en) | Bid file similarity calculation method and device | |
CN113330430B (en) | Sentence structure vectorization device, sentence structure vectorization method, and recording medium containing sentence structure vectorization program | |
Khomytska et al. | Automated Identification of Authorial Styles. | |
Gholami-Dastgerdi et al. | Part of speech tagging using part of speech sequence graph | |
van Heusden et al. | Wooir: A new open page stream segmentation dataset | |
Rajan et al. | Survey of nlp resources in low-resource languages nepali, sindhi and konkani | |
Shivakumar et al. | Comparative study of factored smt with baseline smt for english to kannada |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |