CN110134942B - Text hotspot extraction method and device - Google Patents

Text hotspot extraction method and device Download PDF

Info

Publication number
CN110134942B
CN110134942B CN201910260924.8A CN201910260924A CN110134942B CN 110134942 B CN110134942 B CN 110134942B CN 201910260924 A CN201910260924 A CN 201910260924A CN 110134942 B CN110134942 B CN 110134942B
Authority
CN
China
Prior art keywords
text data
short text
similarity
short
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910260924.8A
Other languages
Chinese (zh)
Other versions
CN110134942A (en
Inventor
王宇琪
孔庆超
黄秋曼
方省
曹家
罗引
王磊
赵菲菲
张西娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Wenge Technology Co ltd
Original Assignee
Beijing Zhongke Wenge Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Wenge Technology Co ltd filed Critical Beijing Zhongke Wenge Technology Co ltd
Priority to CN201910260924.8A priority Critical patent/CN110134942B/en
Publication of CN110134942A publication Critical patent/CN110134942A/en
Application granted granted Critical
Publication of CN110134942B publication Critical patent/CN110134942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention relates to a text hotspot extraction method and a text hotspot extraction device, wherein the method comprises the following steps: segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining the similarity between any two text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the short sentence formed by syntactically analyzing and extracting the relation words improves the observability and accuracy of information extraction, so that a user can better understand the text content to obtain core key information points, the short sentence is vectorized through Word2vec to carry out similarity comparison, and semantic information among words is kept, so that the accuracy of the duplication elimination work is ensured, and the redundancy of hot spot information is avoided as much as possible.

Description

Text hotspot extraction method and device
Technical Field
The embodiment of the invention relates to the field of data processing, in particular to a text hotspot extracting method and device.
Background
The hot spot extraction is to extract a core abstract short sentence as a classification category according to a known text, so that a user can quickly find an interested event topic on an application platform and acquire related information. In order to improve the accuracy of information extraction and promote the comprehensibility of an extraction result, the scheme proposes that a dependency syntax analysis technology-based method is adopted to realize the extraction of short texts with semantic comprehension; and merging the extraction results based on the similarity technique.
At present, more related information extraction tasks are performed based on a keyword extraction technology, and the keyword can be a single word or a phrase formed by a plurality of words and is the minimum unit for expressing the theme meaning of the document.
However, keyword extraction can only identify the most representative segment or vocabulary of a document for a certain event or topic, and cannot accurately reflect the whole content of the text.
Disclosure of Invention
In view of the above, in order to solve the technical problem or at least partially solve the technical problem, embodiments of the present invention provide a text hotspot extracting method and apparatus.
In a first aspect, an embodiment of the present invention provides a text hotspot extracting method, including:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
In one possible embodiment, the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm includes:
determining the second short text data to perform sentence component analysis based on an HANLP dependency syntax analysis algorithm;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
In a possible embodiment, the vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors includes:
performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word segmentation result by adopting Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
In one possible embodiment, the determining the similarity between any two text vectors based on a similarity algorithm includes:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000031
wherein A isiAs text vectors A, BiIs a text vector B.
In a possible embodiment, the merging the two text vectors with the similarity greater than the similarity threshold includes:
when the similarity is larger than a similarity threshold value, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
In a second aspect, an embodiment of the present invention provides a text hotspot extracting device, including:
the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
the generating module is used for generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
the processing module is used for vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
the calculation module is used for determining the similarity between any two text vectors based on a similarity algorithm;
and the merging module is used for merging the two text vectors with the similarity greater than a similarity threshold value.
In a possible embodiment, the generating module is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
In a possible implementation manner, the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization processing on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
In a possible implementation manner, the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000041
wherein A isiAs text vectors A, BiIs a text vector B.
In a possible embodiment, the merging module is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
According to the text hotspot extraction scheme provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.
Drawings
Fig. 1 is a schematic flowchart of a text hotspot extracting method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating generation of fourth short text data according to an embodiment of the present invention;
FIG. 3 is a diagram of a directed graph of component relationships according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding of the embodiments of the present invention, the following description will be further explained with reference to specific embodiments, which are not to be construed as limiting the embodiments of the present invention.
Fig. 1 is a schematic flowchart of a text hotspot extracting method provided in an embodiment of the present invention, and as shown in fig. 1, the method specifically includes:
and S11, segmenting at least one input text data by adopting a regular expression according to a set rule to obtain a plurality of first short text data.
In this embodiment, the at least one input text data may be internet text data acquired by a crawler, based on the input of a single text data, the document is divided according to a specified format by using a regular expression based on punctuation marks, a plurality of first short text data obtained by dividing the document are returned, and a short sentence list is generated, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold, and the character length corresponding to the third short text data is not greater than the set character threshold.
It should be noted that the character threshold may be set according to actual requirements, such as 2, 4, 6, 8, 10, and the like, and this embodiment is not limited in particular.
And S12, generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm.
The specific steps of generating the fourth short text data may refer to fig. 2, and specifically include:
and S21, determining the second short text data to perform sentence component analysis based on the HANLP dependency syntax analysis algorithm.
And S22, determining words with labeled semantic relations in the second short text data.
Based on the HANLP dependency syntactic analysis algorithm, word segmentation and part-of-speech tagging are carried out on the second short text data, a directed graph with syntactic component relations is generated, the semantic relations of each word in a sentence are tagged, and the generated directed graph with the component relations can refer to FIG. 3.
S23, selecting the main and subordinate relation words, the moving object relation words and the core relation words representing the second short text data from the words with the labeled semantic relations.
And S24, generating corresponding fourth short text data according to the main and predicate relationship words, the moving and guest relationship words and the core relationship words.
Core relation words, major-predicate elements and dynamic guest elements in corresponding sentences can be extracted based on a dependency syntax analysis algorithm and are used for forming new short sentences with semantic structures and used for abstract extraction tasks of documents. The dependency syntax analysis meets the condition that only one component in each sentence is independent, other components in the sentence are all affiliated to a certain component, any one component cannot depend on two or more components, and other components on the left side and the right side of the core relation word are not in relation with each other.
And performing component analysis on the sentence based on an HANLP dependency syntax analysis algorithm to obtain words with labeled semantic relationship types, selecting words which can directly describe the subject matter of the sentence content in the sentence, such as a main-and-predicate relationship, a moving-object relationship, a core relationship and the like, and combining the words to form new fourth short text data.
And S13, vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors.
Specifically, Word segmentation is carried out on the third short text data and the fourth short text data, and vectorization processing is carried out on a Word2vec result after Word segmentation; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
And respectively carrying out Word segmentation on the short sentences obtained in the step by adopting Word2vec Word vector square, converting the Word into Word-based vector units, and calculating the average Word vector of a single short sentence. The calculation formula is as follows,
Figure GDA0002586922580000071
where Vi is the word vector and Pv is the phrase vector.
And S14, determining the similarity between any two text vectors based on a similarity algorithm.
Specifically, the similarity between any two text vectors is calculated by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000072
wherein A isiAs text vectors A, BiIs a text vector B.
And S15, merging the two text vectors with the similarity greater than the similarity threshold value.
Specifically, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
Further, the similarity threshold may be set according to actual requirements, such as 0.6, 0.8, and the like, which is not limited in this embodiment.
When the similarity is larger than the similarity threshold, the two short text data are determined to belong to the same topic, so that hot spot information is merged, the problem of repeated topics during integration of extracted hot spot short sentences is avoided, and when the similarity is not larger than the similarity threshold, the two short text data are determined to be irrelevant.
According to the text hotspot extraction method provided by the embodiment of the invention, at least one input text data is segmented according to a set rule by adopting a regular expression to obtain a plurality of first short text data; generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm; vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors; determining a similarity between any two of the text vectors based on a similarity algorithm; the two text vectors with the similarity larger than the similarity threshold are merged, the observability and the accuracy of information extraction are improved by the short sentence formed by extracting the relation words based on syntactic analysis, so that a user can better understand the text content to obtain core key information points, the similarity comparison is carried out on the short sentence vectorization through Word2vec, and the semantic information among the words is kept, so that the accuracy of the duplicate removal work is ensured, and the redundancy of hot spot information is avoided as much as possible.
Fig. 4 is a schematic structural diagram of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 4, the device specifically includes:
a segmentation module 401, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
a generating module 402, configured to generate, by using a dependency parsing algorithm, corresponding fourth short text data from the second short text data;
a processing module 403, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain multiple corresponding text vectors;
a calculating module 404, configured to determine a similarity between any two text vectors based on a similarity algorithm;
a merging module 405, configured to merge the two text vectors with the similarity greater than a similarity threshold.
Optionally, the generating module 402 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, the processing module 403 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the calculating module 404 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000091
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, the merging module 405 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 4, and may perform all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.
Fig. 5 is a schematic diagram of a hardware structure of a text hotspot extracting device according to an embodiment of the present invention, as shown in fig. 5, the device specifically includes:
processor 510, memory 520, transceiver 530.
Processor 510 may be a Central Processing Unit (CPU), or a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.
The memory 520 is used to store various applications, operating systems, and data. The memory 520 may transfer the stored data to the processor 510. The memory 520 may include a volatile memory, a nonvolatile dynamic random access memory (NVRAM), a phase change random access memory (PRAM), a Magnetoresistive Random Access Memory (MRAM), and the like, such as at least one magnetic disk memory device, an electrically erasable programmable read-only memory (EEPROM), a flash memory device, such as a flash memory (NOR) or a flash memory (NAND), a semiconductor device, such as a Solid State Disk (SSD), and the like. The memory 520 may also comprise a combination of memories of the kind described above.
A transceiver 530 for transmitting and/or receiving data, the transceiver 530 may be an antenna, etc.
The working process of each device is as follows:
a processor 510, configured to perform segmentation processing on at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
a processor 510, configured to generate a fourth corresponding short text data from the second short text data by using a dependency parsing algorithm;
a processor 510, configured to perform vectorization processing on the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
a processor 510 for determining a similarity between any two of the text vectors based on a similarity algorithm;
and the processor 510 is configured to perform merging processing on the two text vectors with the similarity greater than a similarity threshold.
Optionally, the processor 510 is specifically configured to determine the second short text data based on an HANLP dependency parsing algorithm to perform sentence component analysis; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, the processor 510 is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the processor 510 is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000121
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, the processor 510 is specifically configured to determine that the two text vectors belong to the same topic when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
The text hotspot extracting device provided in this embodiment may be the text hotspot extracting device shown in fig. 5, and may execute all the steps of the text hotspot extracting method shown in fig. 1, so as to achieve the technical effect of the text hotspot extracting method shown in fig. 1.
The embodiment of the invention also provides a storage medium (computer readable storage medium). The storage medium herein stores one or more programs. Among others, the storage medium may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, a hard disk, or a solid state disk; the memory may also comprise a combination of memories of the kind described above.
When one or more programs in the storage medium are executable by one or more processors, the text hotspot extraction method executed on the text hotspot extraction device side is realized.
The processor is configured to execute the text hotspot extracting program stored in the memory, so as to implement the following steps of the text hotspot extracting method executed on the text hotspot extracting device side:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, wherein the character length corresponding to the second short text data is greater than a set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
Optionally, determining the second short text data based on an HANLP dependency syntax analysis algorithm for sentence component analysis;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
Optionally, performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word-segmented result by using Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
Optionally, the determining the similarity between any two text vectors based on a similarity algorithm includes:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure GDA0002586922580000131
wherein A isiAs text vectors A, BiIs a text vector B.
Optionally, when the similarity is greater than a similarity threshold, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A text hotspot extraction method is characterized by comprising the following steps:
the method comprises the following steps of adopting a regular expression to carry out segmentation processing on at least one input text data according to a set rule to obtain a plurality of first short text data, wherein the first short text data comprise: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;
generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
determining a similarity between any two of the text vectors based on a similarity algorithm;
and merging the two text vectors with the similarity larger than a similarity threshold value.
2. The method according to claim 1, wherein the generating the second short text data into corresponding fourth short text data by using a dependency parsing algorithm comprises:
determining the second short text data to perform sentence component analysis based on an HANLP dependency syntax analysis algorithm;
determining words with labeled semantic relations in the second short text data;
selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation;
and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
3. The method of claim 2, wherein vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors comprises:
performing Word segmentation on the third short text data and the fourth short text data, and performing vectorization on a Word segmentation result by adopting Word2 vec;
and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
4. The method of claim 1, wherein determining the similarity between any two text vectors based on a similarity algorithm comprises:
calculating the similarity between any two text vectors by adopting a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure FDA0002586922570000021
wherein the content of the first and second substances,Aias text vectors A, BiIs a text vector B.
5. The method according to claim 4, wherein the merging the two text vectors with the similarity greater than the similarity threshold comprises:
when the similarity is larger than a similarity threshold value, determining that the two text vectors belong to the same topic;
rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
6. A text hotspot extracting device is characterized by comprising:
the segmentation module is configured to segment at least one input text data according to a set rule by using a regular expression to obtain a plurality of first short text data, where the first short text data includes: the second short text data and the third short text data, the character length corresponding to the second short text data is greater than a set character threshold value, and the character length corresponding to the third short text data is not greater than the set character threshold value;
the generating module is used for generating corresponding fourth short text data from the second short text data by adopting a dependency syntax analysis algorithm;
the processing module is used for vectorizing the third short text data and the fourth short text data to obtain a plurality of corresponding text vectors;
the calculation module is used for determining the similarity between any two text vectors based on a similarity algorithm;
and the merging module is used for merging the two text vectors with the similarity greater than a similarity threshold value.
7. The apparatus according to claim 6, wherein the generating module is specifically configured to determine the second short text data for sentence component analysis based on an HANLP dependency parsing algorithm; determining words with labeled semantic relations in the second short text data; selecting a main and subordinate relation word, a moving and guest relation word and a core relation word representing the second short text data from the words with the labeled semantic relation; and generating corresponding fourth short text data according to the main and predicate relation words, the moving and guest relation words and the core relation words.
8. The apparatus according to claim 7, wherein the processing module is specifically configured to perform Word segmentation on the third short text data and the fourth short text data, and perform vectorization on a result after Word segmentation by using Word2 vec; and determining an average word vector corresponding to each short text data corresponding to the vectorization processing result as a text vector.
9. The apparatus according to claim 6, wherein the calculating module is specifically configured to calculate a similarity between any two text vectors by using a first formula based on a cosine similarity algorithm;
wherein the first formula is:
Figure FDA0002586922570000031
wherein A isiAs text vectors A, BiIs a text vector B.
10. The apparatus according to claim 9, wherein the merging module is specifically configured to determine that the two text vectors belong to a topic of the same type when the similarity is greater than a similarity threshold; rearranging the short text data corresponding to the two texts, determining the word with the highest frequency number in the two short text data, and merging to obtain the fifth merged short text data.
CN201910260924.8A 2019-04-01 2019-04-01 Text hotspot extraction method and device Active CN110134942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910260924.8A CN110134942B (en) 2019-04-01 2019-04-01 Text hotspot extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910260924.8A CN110134942B (en) 2019-04-01 2019-04-01 Text hotspot extraction method and device

Publications (2)

Publication Number Publication Date
CN110134942A CN110134942A (en) 2019-08-16
CN110134942B true CN110134942B (en) 2020-10-23

Family

ID=67569047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910260924.8A Active CN110134942B (en) 2019-04-01 2019-04-01 Text hotspot extraction method and device

Country Status (1)

Country Link
CN (1) CN110134942B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874531B (en) * 2020-01-20 2020-07-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN112069785A (en) * 2020-08-06 2020-12-11 北京明略昭辉科技有限公司 Text sampling method and device for improving labeling efficiency
CN112101008A (en) * 2020-09-27 2020-12-18 北京百度网讯科技有限公司 Text popularity determination method and device, electronic equipment and storage medium
CN112183111A (en) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 Long text semantic similarity matching method and device, electronic equipment and storage medium
CN112364641A (en) * 2020-11-12 2021-02-12 北京中科闻歌科技股份有限公司 Chinese countermeasure sample generation method and device for text audit
CN113052487A (en) * 2021-04-12 2021-06-29 平安国际智慧城市科技股份有限公司 Evaluation text processing method and device and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet
CN109101489A (en) * 2018-07-18 2018-12-28 武汉数博科技有限责任公司 A kind of text automatic abstracting method, device and a kind of electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106372208B (en) * 2016-09-05 2019-07-12 东南大学 A kind of topic viewpoint clustering method based on statement similarity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319668A (en) * 2018-01-23 2018-07-24 义语智能科技(上海)有限公司 Generate the method and apparatus of text snippet
CN109101489A (en) * 2018-07-18 2018-12-28 武汉数博科技有限责任公司 A kind of text automatic abstracting method, device and a kind of electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
K Nearest Neighbor for Text Summarization using Feature Similarity;Taeho Jo;《2017 International Conference on Communication, Control, Computing and Electronics Engineering》;20170116;第1-5页 *
基于拓扑结构的微博话题摘要生成算法;赵斌 等;《数据采集与处理》;20140930;第29卷(第5期);第720-729页 *

Also Published As

Publication number Publication date
CN110134942A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134942B (en) Text hotspot extraction method and device
Bhargava et al. Sentiment analysis for mixed script indic sentences
US10360294B2 (en) Methods and systems for efficient and accurate text extraction from unstructured documents
US8938384B2 (en) Language identification for documents containing multiple languages
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
US10528664B2 (en) Preserving and processing ambiguity in natural language
JP2005158010A (en) Apparatus, method and program for classification evaluation
CN111291177A (en) Information processing method and device and computer storage medium
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN108052509A (en) A kind of Text similarity computing method, apparatus and server
Abinaya et al. Amrita_cen@ fire-2014: Named entity recognition for indian languages using rich features
Krishna et al. A dataset for Sanskrit word segmentation
Noshin Jahan et al. Bangla real-word error detection and correction using bidirectional lstm and bigram hybrid model
CN113449084A (en) Relationship extraction method based on graph convolution
Nararatwong et al. Improving Thai word and sentence segmentation using linguistic knowledge
Castro et al. Discriminating between Brazilian and European Portuguese national varieties on Twitter texts
CN111160445B (en) Bid file similarity calculation method and device
Baniata et al. Sentence representation network for Arabic sentiment analysis
Rajan et al. Survey of nlp resources in low-resource languages nepali, sindhi and konkani
Saikrishna et al. Sentiment analysis on Telugu–English code-mixed Data
CN114138936A (en) Text abstract generation method and device, electronic equipment and storage medium
Fresko et al. A hybrid approach to NER by MEMM and manual rules
Shivakumar et al. Comparative study of factored smt with baseline smt for english to kannada
Khomytska et al. Automated Identification of Authorial Styles.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant