WO2021051599A1

WO2021051599A1 - Method and apparatus for extracting locally optimized keywords, device and storage medium

Info

Publication number: WO2021051599A1
Application number: PCT/CN2019/118273
Authority: WO
Inventors: 陈婷婷
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-19
Filing date: 2019-11-14
Publication date: 2021-03-25
Also published as: CN110765767B; CN110765767A

Abstract

The present application relates to the technical field of large data. Disclosed is a method for extracting locally optimized keywords, comprising: receiving a text to be processed, and identifying characters in the title, the first paragraph and the last paragraph of the text to be processed; acquiring, on the basis of a preset Chinese word segmentation system, target segmented words in the title, the first paragraph and the last paragraph, and updating the part of speech of the target segmented words to the part of speech of keywords; recording, in a preset hash table, weight parameters corresponding to the target segmented words by means of a part-of-speech score comparison table in the Chinese word segmentation system; traversing the text to be processed, so as to acquire associated segmented words of the target segmented words and the part of speech of the associated segmented words, and recording, in the hash table, weight parameters of the associated segmented words; and extracting target segmented words and/or associated segmented words which have total scores in the top five as the keywords of the text to be processed. Further disclosed are an apparatus for extracting a locally optimized keywords, a server and a storage medium. According to the target segmented word in the essential concept, errors are reduced, and the accuracy of text keywords is improved.

Description

Method, device, equipment and storage medium for locally optimizing keywords

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 201910884825.7, and the invention title is "Methods, Devices, Equipment, and Storage Media for Partially Optimizing Keywords", the entire content of which is approved The reference is incorporated in the application.

Technical field

This application relates to the field of big data technology, and in particular to a method, device, server, and computer-readable storage medium for extracting locally optimized keywords.

Background technique

In the research of natural language processing, keywords represent the central idea of the text, and play an important role in text retrieval and text classification. Therefore, keyword extraction technology is valued by a large number of scholars. Due to the traditional keyword method based on statistical features, it pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignores the overall central idea of the article. At present, most keyword extraction algorithms add features such as the association relationship of word segmentation to the traditional statistical feature algorithm, so as to obtain the final keyword. Many scholars at home and abroad filter out a large number of word segments appearing in the corpus based on the weighted word frequency of tf-idf, but it depends heavily on the number of corpora, which may deviate the importance of the word segmentation from its normal value. The inventor realized that although the keyword extraction method based on complex networks considers the degree of word segmentation, it pays too much attention to the characteristics of the "small world", ignoring the influence of the "big world" and the central idea of the text content level, resulting in keyword extraction The accuracy is low.

Summary of the invention

The main purpose of this application is to provide a method for extracting locally optimized keywords, which aims to solve the problem that the prior art keyword methods are based only on statistical features, which pay too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignore the article’s The overall central idea, leading to technical problems of inaccurate keywords.

To achieve the above objective, the present application provides a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:

Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;

Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;

Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;

Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.

In addition, in order to achieve the above objective, the present application also provides a device for extracting locally optimized keywords, the device for extracting locally optimized keywords includes:

The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;

The update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set. The part of speech of the target participle is the keyword part of speech;

The first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;

The second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

The extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.

In addition, in order to achieve the above object, the present application also provides a server, the server includes: a memory, a processor, and a locally optimized keyword extraction program stored on the memory and running on the processor, so When the program for extracting locally optimized keywords is executed by the processor, the steps of the method for extracting locally optimized keywords as described in the above application are implemented.

In addition, in order to achieve the above objective, the present application also provides a computer-readable storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer can execute the above-mentioned partial optimization. Keyword extraction method.

The method, device, server, and computer-readable storage medium for extracting locally optimized keywords proposed in the embodiments of the present application receive the text to be processed, and recognize the characters in the title, first paragraph, and last paragraph of the text to be processed; Preset Chinese word segmentation system to segment the characters in the title, first paragraph and end, and obtain the word segmentation set of the title, first paragraph and end, and update the part of speech of the target word in the word segmentation set as the key Part of speech; through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the to-be Process the text, obtain the related participle of the target participle and the part of speech of the related participle, and record the weight parameter of the related participle in the hash table; according to the keyword part of speech of the target participle and the part of speech of each related participle In the weight parameter of the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed, and realize the part-of-speech score and word frequency based on the target word segmentation in the central idea As well as the part-of-speech score and word frequency of the related word segmentation, the target word or related word segmentation with the highest total score is obtained as the keyword, which reduces the error and improves the accuracy of the text keyword.

Description of the drawings

FIG. 1 is a schematic diagram of a server structure of a hardware operating environment involved in a solution of an embodiment of the application;

2 is a schematic flowchart of a first embodiment of a method for extracting locally optimized keywords according to this application;

FIG. 3 is a schematic diagram of the detailed flow of step S10 in FIG. 2;

FIG. 4 is a schematic diagram of the detailed flow of step S20 in FIG. 2;

FIG. 5 is a detailed flowchart of step S30 in FIG. 2;

6 is a schematic flowchart of a second embodiment of a method for extracting partially optimized keywords according to this application;

Fig. 7 is a detailed flowchart of step S50 in Fig. 2.

detailed description

It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

The main solution of the embodiment of this application is to receive the text to be processed, and identify the characters in the title, first paragraph and the last paragraph of the text to be processed; based on the preset Chinese word segmentation system, perform the characterization of the title, the first paragraph and the last paragraph. Segmentation, and obtain the word segmentation set in the title, first paragraph and end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech; through the part of speech score comparison table in the Chinese word segmentation system, the weight corresponding to each target word segmentation The parameters are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the text to be processed, obtain the related participle of the target word segmentation and the part-of-speech of the related participle, and record the weight parameters of the related word segmentation in the hash table In; according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table, extract the top five target participles and/or related participles of the total score as the keywords of the text to be processed.

Because the prior art keyword method based on statistical features pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, it ignores the overall central idea of the article, which leads to technical problems of inaccurate keywords.

This application provides a solution. Through the part-of-speech score and word frequency of the target word segmentation and the part-of-speech score and word frequency of the related participle in the central idea, the target word or related participle with the highest total score is obtained as a keyword, which reduces the error. Improve the accuracy of text keywords.

As shown in FIG. 1, FIG. 1 is a schematic diagram of the server structure of the hardware operating environment involved in the solution of the embodiment of the application.

The terminal in the embodiment of this application is a server.

As shown in FIG. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.

Those skilled in the art can understand that the terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.

As shown in FIG. 1, the memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a program for extracting partially optimized keywords.

In the terminal shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server; the user interface 1003 is mainly used to connect to a client (user side) to communicate with the client; and the processor 1001 can be used to call the extraction program of locally optimized keywords stored in the memory 1005, and perform the following operations:

Receive the text to be processed, and identify the characters in the title, first paragraph and last paragraph of the text to be processed;

Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to the keyword part of speech;

Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;

Traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

According to the keyword part of speech of the target word segmentation and the weight parameters of the part-of-speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.

Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:

Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;

Use the character between the first space character position and the second space character position as the title of the text to be processed, and the character between the second space character position and the third space character position as the first paragraph of the text to be processed, and set N-( N-1) The space between the space character position and the N space character position is used as the end of the text to be processed;

Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.

When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to perform the part of speech of the characters in the title, the first paragraph and the last paragraph according to the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. Divide

Obtain the part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with a part-of-speech score greater than 0 as the target participle;

The target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.

Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

The target word segmentation is used as the search condition, and the word frequency of each target word segmentation in the title, the first paragraph and the end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.

Traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system to divide the characters in the text to be processed into multiple word segments;

Extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;

When the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech and word frequency of the related participle;

By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.

When the first participle is not the target participle in the word segmentation set, judge whether the first participle is the related participle of the target participle;

When it is determined that the first participle is the related participle of the target participle, the part of speech and the word frequency of the first participle are recorded in the hash table.

Obtain the preset calculation rules and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;

By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation, and extract the top five target words with the total score value extracted And/or related word segmentation is the key word of the text to be processed.

2, this application is a first embodiment of a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:

Step S10, receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;

When the server receives the text to be processed from the terminal, it determines the position of the title, the first paragraph and the end of the text. Specifically, when the server obtains the text to be processed, the title is generally located in the middle of the first line of the text to be processed. It may be in the upper line of a certain paragraph, and the title characters are generally in bold form. The first paragraph is generally located in the second line of the text to be processed, and the characters in the first paragraph are generally the first space character (two characters of space), and the first space character to the second space in the second line is regarded as the text to be processed The first paragraph. The end is located between the last character and the second space on the second line. The server obtains the position of the space before the character in the text to be processed, so as to determine the position of the first paragraph and the end. Call character recognition software, scan the text to be processed, and obtain the characters in the title, first paragraph, and end of the text to be processed.

Step S20, based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segment in the word segmentation set to the keyword part of speech. ；

Chinese Word Segmentation refers to the segmentation of a sequence of Chinese characters into individual words. Chinese word segmentation is the basis of text mining. For a piece of Chinese input, successfully performing Chinese word segmentation can achieve the effect of automatically identifying the meaning of the sentence. Store all the words in the Chinese word segmentation system, scan the processed text, find all possible words, and then see which word can be output. Such as: text to be processed: I am a student; words: I/Yes/student. The server is calling the preset Chinese word segmentation system. The server uses the Chinese analysis system to segment the characters in the title, first paragraph and end of the text to be processed, and reads the word segmentation in the title, first paragraph and end of the text to be processed. Collect the read word segmentation to obtain the word segmentation set in the title, first paragraph and end of the text to be processed. The word segmentation in the word segmentation set is used as the target word segmentation, and the part of speech of the target word segmentation is identified as the keyword part of speech.

Step S30, through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are the part-of-speech score and the word frequency;

When the server obtains the word segmentation set, it retrieves the part of speech score table in the Chinese word segmentation system, based on the Chinese word segmentation system, obtains the part of speech of each target word segmentation in the word segmentation set, and obtains the corresponding part of each target word through the part of speech score table in the Chinese word segmentation system The score value of is used as the weight parameter of the target word segmentation and the corresponding score value is recorded in the hash table.

Step S40, traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table.

The server starts to traverse the text to be processed. Specifically, the server calls character recognition software to traverse the text to be processed, recognizes all the characters in the text to be processed, and splits the recognized characters based on the preset Chinese word segmentation system. When the word segmentation in the word segmentation, the obtained word segmentation is matched with the target participle in the word segmentation set. When the word segmentation is the target word segmentation, record the word frequency of the word segmentation, and the participle before and after the target word segmentation as the related participle, and record the Associate the word frequency of the word segmentation, go to step 30, when the word segmentation is not the target word segmentation, match the next word segmentation until it matches all the word segmentation in the text to be processed;

Step S50, according to the keyword part of speech of the target word segmentation and the weight parameter of the part of speech of each related participle in the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the to-be-processed text.

After the server processes all the word segmentation in the text by matching, it sorts the keywords recorded in the hash table and the weight parameters corresponding to the associated word segmentation from largest to smallest, extracts the keywords corresponding to the top five weight parameters, and puts the weight parameters in the top five The corresponding keyword is determined as the target keyword, and the target keyword is used as the target keyword of the text to be processed.

In this embodiment, by taking the title, first paragraph, and end of the text as the central idea of the text, the title, first paragraph, and end of the text to be processed are analyzed and segmented to obtain the word frequency and part of speech for multiple target analysis. By obtaining the part-of-speech and word frequency of the related participle of the target participle in the text to be processed, the total part-of-speech value of each target analysis and related participle is obtained. The part-of-speech score of the target participle in the central idea, the word frequency and the part-of-speech score of the related participle, Word frequency, the target word or related word segmentation with the highest total score is obtained as keywords, which reduces errors and improves the accuracy of text keywords.

Further, referring to FIG. 3, FIG. 3 is a second embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S10 includes:

Step S11, receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;

Step S12: Use the characters between the first space character position and the second space character position as the title of the text to be processed, and use the characters between the second space character position and the third space character position as the first paragraph of the text to be processed, and The space between the N-(N-1) space character position and the N space character position is used as the end of the text to be processed;

In step S13, a preset character recognition program is called to recognize the characters in the title, the first paragraph and the last paragraph.

After receiving the processed text sent by the terminal, the server obtains the position of the space character and the number N of space characters in the text to be processed. The specific implementation is that the server receives the text to be processed, scans the text to be processed, obtains the blank space of each line in the text to be processed, and records the position and the number N of the blank space. Use the space between the first blank position and the second blank position as the title of the text to be processed. The title is generally located on the first line of the text, and the first character of the title is generally two blank characters in the line. The space between the second blank position and the third blank position is taken as the first paragraph of the text to be processed. The N-th blank position and the N-(N-1)-th blank position are regarded as the end of the text to be processed. For example, the end character of the end of the text to be processed is not a blank character, but is a special symbol ".", " !", "?", etc., treat them as blank characters. The server invokes preset character recognition software, recognizes the title, first paragraph, and end paragraph of the processed text, and obtains all characters in the title, first paragraph, and end paragraph of the processed text.

In this embodiment, the text is processed by obtaining the number and position of the space characters of the text to be processed, so as to obtain the title, first paragraph, and last paragraph of the text to be processed, and then obtain the title, first paragraph and the last paragraph through a character recognition program. The characters in the last paragraph can quickly divide the to-be-processed text into the title, the first paragraph and the last paragraph through the space character.

Referring to Fig. 4, Fig. 4 is a third embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in Fig. 2 above, step S20 includes:

Step S21: When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to classify the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. The part of speech is divided;

Step S22: Obtain the part-of-speech scores in the part-of-speech score comparison table of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with the part-of-speech score greater than 0 as the target word segmentation;

In step S23, the target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.

The server activates the preset Chinese word segmentation system when all characters in the title, first paragraph and end of the text to be processed are to be processed, and the characters automatically recognized by the Chinese word segmentation system are segmented. The specific implementation is in the Chinese word segmentation system. Nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words are recorded. The Chinese word segmentation system matches the acquired characters with the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. For example, first obtain one Characters are matched with recorded nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. When the match is unsuccessful, get two characters to match the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words Until the match is successful. The server obtains the Chinese word segmentation system to segment the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words in the title, first and last paragraphs, and obtain the characters of nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. The part of speech score in the Chinese word segmentation system compares the part of speech scores in the table, and nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words whose part of speech scores are greater than 0 are determined as target participles. The nouns, verbs, adjectives, prepositions, punctuation, quantifiers and new words are grouped into word segmentation, that is, there are two identical nouns, only one is kept, and the part of speech of the target participle in the participle set is updated, and the target participle is updated to the keyword part of speech. The part of speech of the target participle is nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms. The part of speech such as nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms are identified as keywords.

In this embodiment, the title, the first paragraph, and the last paragraph are segmented through a preset Chinese analysis system to obtain different characters, and then the part-of-speech score of each character is obtained through the part-of-speech score comparison table, and this item is divided. Characters with a value greater than 0 are determined as the target word segmentation, and the part of speech of the target word segmentation is keyword nature, and the target word segmentation in the title, first paragraph and last paragraph can be extracted quickly and accurately.

Referring to FIG. 5, FIG. 5 is a fourth embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S30 includes:

Step S31, retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

In step S32, the target word segmentation is used as a search condition, and the word frequency of each target word segmentation in the title, first paragraph and end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.

The server retrieves the part-of-speech score comparison table in the preset Chinese word segmentation system. The part-of-speech score comparison table records the part-of-speech scores of nouns, verbs, adjectives, prepositions, punctuation, quantifiers, keywords, and new words. The specific table is as follows :

词性Part of speech	分数fraction
名词(n)Noun (n)	3.03.0
动词(v)Verb (v)	2.02.0
形容词(a)Adjective (a)	1.01.0
介词(p)Preposition (p)	0.00.0
标点(w)Punctuation (w)	0.00.0
量词(m)Quantifier (m)	0.00.0
关键词(kw)Keywords (kw)	4.04.0
新词(nw)New words (nw)	3.03.0

Compare the score part of speech comparison table, get the score value of the keyword part of speech corresponding to 3.0, search for the word frequency of each target word in the word segmentation set obtained in the title, first paragraph and end, and get the word frequency of each target word And the corresponding keyword score value is recorded in the hash table.

In this embodiment, by comparing the part-of-speech score table, the part-of-speech score of each target word segment is obtained, and through the index, the word frequency of each target word segment in the title, the first paragraph and the last paragraph is obtained, and the obtained word frequency and part-of-speech are recorded In the hash table, the frequency and part of speech of each target word in the title, first paragraph and last paragraph can be quickly obtained.

Referring to FIG. 6, FIG. 6 is a fifth embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S40 includes:

Step S41, traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;

Step S42, extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;

Step S43, when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the part of speech and word frequency of the related participle are obtained;

Step S44: By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.

Start the preset character recognition software to traverse the text to be processed, identify the characters in the text to be processed, the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation; extract the first word segmentation in the text to be processed, Determine whether the first participle is the target participle in the word participle set; when the first participle is the target participle in the word participle set, read the second and third participles before and after the first participle, specifically, the server obtains Chinese The word segmentation position segmented by the word segmentation system, extract the first word segmentation in the text to be processed, when the first word segmentation is the target word segmentation, read the part of speech and word frequency of the second word segmentation and the third word segmentation, and obtain the association The part of speech comparison of word segmentation compares the part of speech score comparison table in the Zhongnongwen word segmentation system to obtain the part of speech score corresponding to the related word segmentation, and record the part of speech score and word frequency of the related word segmentation in the hash table. When the second participle before the first participle or the third participle after the first participle is a blank character or a special symbol, the third participle or the second participle is not read, and the next participle is obtained.

When the server determines that the first participle is not a participle in the word segmentation set, it determines whether the first participle is an associated participle of the target participle. Specifically, when the character of the first participle is recognized=, the first analyzed character is compared with the character of the target participle. When the character of the first participle is not the same as the character of the target participle, the character of the first participle is Compare the characters of the related participle of the target participle to determine whether the first participle is a related participle. When the characters of the first participle match the characters of the related participle, record the part of speech and word frequency of the first participle to the hash In the table, and the word frequency is recorded once.

In this embodiment, in this embodiment, the preset character recognition software is started to traverse the text to be processed, to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Word segmentation; extract the first participle in the text to be processed, and determine whether the first participle is the target participle in the word segmentation set; when the first participle is the target participle in the word segmentation set, read the second participle before and after the first word segmentation And the third word segmentation, to quickly obtain the related word segmentation of the target word segmentation in the text to be processed.

Referring to FIG. 7, FIG. 7 is a seventh embodiment of the method for extracting locally optimized keywords according to this application. Based on the embodiment shown in FIG. 2, after step S50, the method further includes:

Step S51: Obtain preset calculation rules, and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;

Step S52, by sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the top five total score values. The target participle and/or related participle of is the key word of the text to be processed.

The server is obtaining the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table through the preset calculation rules. Specifically, it obtains the word frequency of any target word segmentation. The word frequency is also in the text to be processed. Process the number of target word segmentation and the corresponding part-of-speech score in the process, multiply the word frequency by the part-of-speech score to get the total score of the target word segmentation, calculate the total score of all the target word segmentation and related word segmentation in the hash table, and pass Sort the total scores of the target segmentation and the related segmentation in the order from largest to smallest and from smallest to largest, and the top five with the largest total score are the target or related word segmentation, and the top five with the largest total score are extracted as the target Word segmentation or related word segmentation is the key word of the text to be processed.

In this embodiment, the server is acquiring preset calculation rules, and calculates the total score of each target segmentation and associated word segmentation in the hash table through the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table. Sort from big to small and from small to big. The top five with the largest total score are the target word segmentation or related word segmentation, and the top five with the largest total score are extracted as the target word segmentation or related word segmentation as the key word of the text to be processed . Thereby reducing errors and improving the accuracy of text keywords.

In addition, an embodiment of the present application also proposes a device for extracting locally optimized keywords. The device for extracting locally optimized keywords includes:

The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, first paragraph and the last paragraph of the text to be processed;

The update unit is used to segment the characters in the title, the first paragraph and the end based on the preset Chinese word segmentation system, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segmentation in the word segmentation set as the key Part of speech

The first recording unit is used to record the weight parameters corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are the part-of-speech score and the word frequency;

The second recording unit is used to traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameters of the related word segmentation in the hash table;

The extraction unit is used to extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table.

Further, the above-mentioned recognition unit is specifically configured to: receive the text to be processed, and obtain the position of the space character in the text to be processed and the number N of space characters, where the number of space characters N is greater than 3;

Further, the above-mentioned update unit is specifically used for: when the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, and prepositions. , Punctuation, quantifiers, and neologisms are divided into parts of speech;

Further, the above-mentioned first recording unit is specifically used to: retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

The target word segmentation is used as the search condition to index the word frequency of each target word segmentation in the title, first paragraph and end, and the score value and word frequency of each target word segmentation are recorded in the hash table.

Further, the second recording unit includes: a recognition subunit for traversing the text to be processed through a preset character recognition software, and recognizing characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Participle

The first judgment subunit is used for extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;

The first determination subunit is used to determine when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech of the related participle And word frequency;

The acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the hash table.

Further, the above-mentioned device for extracting locally optimized keywords further includes:

The second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;

The second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.

Further, the above extraction unit is specifically used for:

The implementation of the functions of each unit in the device for extracting locally optimized keywords corresponds to the steps in the embodiment of the method for extracting locally optimized keywords, and the functions and implementation processes will not be repeated here.

In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:

Traverse the text to be processed, obtain the relevant participle of the target word segmentation and the part of speech of the relevant participle, and record the weight parameters of the relevant participle in the hash table;

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:

Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;

Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;

Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;

Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
The method for extracting locally optimized keywords according to claim 1, wherein said receiving the text to be processed and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed includes:

Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;

Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed The first paragraph of

The N-(N-1) space character position and the N space character position are used as the end of the text to be processed;

Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
According to the method for extracting locally optimized keywords according to claim 2, said based on the preset Chinese word segmentation system, the characters in the title, the first paragraph and the end are segmented, and the title, the first paragraph and the characters are obtained. For the word segmentation set at the end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech, including:

When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;

Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ；

The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
The method for extracting locally optimized keywords according to claim 3, wherein the weight parameters corresponding to each target word segmentation are recorded in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, wherein , The weight parameters are part-of-speech score and word frequency, including:

Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
The method for extracting locally optimized keywords according to claim 4, said traversing the text to be processed, obtaining the related participle of the target word segmentation and the part of speech of the related word segmentation, and recording the weight parameters of the related word segmentation In the hash table, include:

Traversing the text to be processed by the preset character recognition program to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;

Extracting the first word segmentation in the to-be-processed text, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;

When the first participle is the target participle in the word segmentation set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the related participle is obtained The part of speech and word frequency;

By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
5. The method for extracting locally optimized keywords according to claim 4, after extracting the first word segmentation in the text to be processed and judging whether the first word segmentation is the target word segmentation in the word segmentation set, the method further comprises:

When the first participle is not the target participle in the word segmentation set, judging whether the first participle is an associated participle of the target participle;

When it is determined that the first participle is an associated participle of the target participle, the part of speech and word frequency of the first participle are recorded in the hash table.
The method for extracting locally optimized keywords according to any one of claims 1 to 6, wherein the weight parameters in the hash table of the keyword part of speech of the target word segmentation and the part of speech of each associated word segmentation in the hash table are extracted. The top five target participles and/or related participles of the total score are the keywords of the text to be processed, including:

Obtain a preset calculation rule, and calculate a total score of each of the target word segmentation and the associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;

By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the total score The top five target word segmentation and/or related word segmentation are the keywords of the text to be processed.
A device for extracting locally optimized keywords. The device for extracting locally optimized keywords includes:

The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;

The update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set. The part of speech of the target participle is the keyword part of speech;

The first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;

The second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

The extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.
According to the device for extracting locally optimized keywords according to claim 8, the recognition unit is specifically configured to:

Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;

Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed In the first paragraph of, the N-(N-1) space character position and the N space character position are taken as the end of the text to be processed;

Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
According to the device for extracting locally optimized keywords according to claim 9, the updating unit is specifically configured to:

When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;

Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ；

The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
The device for extracting locally optimized keywords according to claim 10, wherein the first recording unit is specifically configured to:

Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
The device for extracting locally optimized keywords according to claim 11, wherein the second recording unit comprises:

The recognition subunit is used to traverse the text to be processed through the preset character recognition software, and recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into Multiple participles;

The first judgment subunit is used to extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;

The first judging subunit is used for judging that the second participle in front of the first participle and the third participle after the first participle are related to the target participle when the first participle is the target participle in the word segmentation set Word segmentation, and obtain the part of speech and word frequency of the related word segmentation;

The acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the Harbin Hope in the table.
The device for extracting locally optimized keywords according to claim 11, the device for extracting locally optimized keywords further comprises:

The second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;

The second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.
According to the device for extracting locally optimized keywords according to any one of claims 8-13, the extracting unit is specifically configured to:

Obtain a preset calculation rule, and calculate a total score of each of the target word segmentation and the associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;

By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the total score The top five target word segmentation and/or related word segmentation are the keywords of the text to be processed.
A device for extracting locally optimized keywords includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements the following steps when the processor executes the computer program:

Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;

Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;

Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;

Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
The device for extracting locally optimized keywords according to claim 15, wherein said processor implements said receiving text to be processed when said computer program is executed, and recognizing characters in the title, first paragraph and last paragraph of said text to be processed When, including the following steps:

Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;

Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed In the first paragraph of, the N-(N-1) space character position and the N space character position are taken as the end of the text to be processed;

Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
The device for extracting locally optimized keywords according to claim 16, when the processor executes the computer program, the processor implements the preset Chinese word segmentation system to cut the characters in the title, the first paragraph, and the end. When the part of speech of the target participle in the word segmentation set is updated to the keyword part of speech, the following steps are included:

When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;

Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ；

The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
The device for extracting locally optimized keywords according to claim 17, when the processor executes the computer program, the weights corresponding to each target word segmentation are calculated through the part-of-speech score comparison table in the Chinese word segmentation system. The parameters are recorded in a preset hash table. When the weight parameters are part of speech score and word frequency, the following steps are included:

Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;

Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
The device for extracting locally optimized keywords according to claim 18, when said processor executes said computer program, said traversal of said to-be-processed text is realized, and the related word segmentation of said target word segmentation and the part of speech of said related word segmentation are obtained , And when the weight parameter of the associated word segmentation is recorded in the hash table, the following steps are included:

Traversing the text to be processed by the preset character recognition program to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;

Extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;

When the first participle is the target participle in the word segmentation set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the related participle is obtained The part of speech and word frequency;

By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
A computer-readable storage medium stores computer instructions in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer executes the following steps:

Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;

Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;

Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;

Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;

According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.