WO2021051599A1 - Method and apparatus for extracting locally optimized keywords, device and storage medium - Google Patents

Method and apparatus for extracting locally optimized keywords, device and storage medium Download PDF

Info

Publication number
WO2021051599A1
WO2021051599A1 PCT/CN2019/118273 CN2019118273W WO2021051599A1 WO 2021051599 A1 WO2021051599 A1 WO 2021051599A1 CN 2019118273 W CN2019118273 W CN 2019118273W WO 2021051599 A1 WO2021051599 A1 WO 2021051599A1
Authority
WO
WIPO (PCT)
Prior art keywords
word segmentation
speech
text
processed
target
Prior art date
Application number
PCT/CN2019/118273
Other languages
French (fr)
Chinese (zh)
Inventor
陈婷婷
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051599A1 publication Critical patent/WO2021051599A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of big data technology, and in particular to a method, device, server, and computer-readable storage medium for extracting locally optimized keywords.
  • keywords represent the central idea of the text, and play an important role in text retrieval and text classification. Therefore, keyword extraction technology is valued by a large number of scholars. Due to the traditional keyword method based on statistical features, it pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignores the overall central idea of the article. At present, most keyword extraction algorithms add features such as the association relationship of word segmentation to the traditional statistical feature algorithm, so as to obtain the final keyword.
  • the main purpose of this application is to provide a method for extracting locally optimized keywords, which aims to solve the problem that the prior art keyword methods are based only on statistical features, which pay too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignore the article’s
  • the overall central idea leading to technical problems of inaccurate keywords.
  • the present application provides a method for extracting locally optimized keywords.
  • the method for extracting locally optimized keywords includes:
  • the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
  • the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
  • the present application also provides a device for extracting locally optimized keywords, the device for extracting locally optimized keywords includes:
  • the recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;
  • the update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set.
  • the part of speech of the target participle is the keyword part of speech;
  • the first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;
  • the second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
  • the extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.
  • the present application also provides a server, the server includes: a memory, a processor, and a locally optimized keyword extraction program stored on the memory and running on the processor, so When the program for extracting locally optimized keywords is executed by the processor, the steps of the method for extracting locally optimized keywords as described in the above application are implemented.
  • the present application also provides a computer-readable storage medium in which computer instructions are stored.
  • the computer instructions When the computer instructions are run on a computer, the computer can execute the above-mentioned partial optimization. Keyword extraction method.
  • the method, device, server, and computer-readable storage medium for extracting locally optimized keywords proposed in the embodiments of the present application receive the text to be processed, and recognize the characters in the title, first paragraph, and last paragraph of the text to be processed; Preset Chinese word segmentation system to segment the characters in the title, first paragraph and end, and obtain the word segmentation set of the title, first paragraph and end, and update the part of speech of the target word in the word segmentation set as the key Part of speech; through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the to-be Process the text, obtain the related participle of the target participle and the part of speech of the related participle, and record the weight parameter of the related participle in the hash table; according to the keyword part of speech of the target participle and the part of speech of each related participle In the weight parameter of
  • FIG. 1 is a schematic diagram of a server structure of a hardware operating environment involved in a solution of an embodiment of the application
  • FIG. 2 is a schematic flowchart of a first embodiment of a method for extracting locally optimized keywords according to this application;
  • FIG. 3 is a schematic diagram of the detailed flow of step S10 in FIG. 2;
  • FIG. 4 is a schematic diagram of the detailed flow of step S20 in FIG. 2;
  • FIG. 5 is a detailed flowchart of step S30 in FIG. 2;
  • FIG. 6 is a schematic flowchart of a second embodiment of a method for extracting partially optimized keywords according to this application;
  • Fig. 7 is a detailed flowchart of step S50 in Fig. 2.
  • the main solution of the embodiment of this application is to receive the text to be processed, and identify the characters in the title, first paragraph and the last paragraph of the text to be processed; based on the preset Chinese word segmentation system, perform the characterization of the title, the first paragraph and the last paragraph.
  • Segmentation and obtain the word segmentation set in the title, first paragraph and end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech; through the part of speech score comparison table in the Chinese word segmentation system, the weight corresponding to each target word segmentation
  • the parameters are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the text to be processed, obtain the related participle of the target word segmentation and the part-of-speech of the related participle, and record the weight parameters of the related word segmentation in the hash table In; according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table, extract the top five target participles and/or related participles of the total score as the keywords of the text to be processed.
  • This application provides a solution.
  • the target word or related participle with the highest total score is obtained as a keyword, which reduces the error. Improve the accuracy of text keywords.
  • FIG. 1 is a schematic diagram of the server structure of the hardware operating environment involved in the solution of the embodiment of the application.
  • the terminal in the embodiment of this application is a server.
  • the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the memory 1005 which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a program for extracting partially optimized keywords.
  • the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server;
  • the user interface 1003 is mainly used to connect to a client (user side) to communicate with the client;
  • the processor 1001 can be used to call the extraction program of locally optimized keywords stored in the memory 1005, and perform the following operations:
  • the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
  • the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • the preset Chinese word segmentation system is activated to perform the part of speech of the characters in the title, the first paragraph and the last paragraph according to the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. Divide
  • part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system and determine the characters with a part-of-speech score greater than 0 as the target participle;
  • the target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • the target word segmentation is used as the search condition, and the word frequency of each target word segmentation in the title, the first paragraph and the end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech and word frequency of the related participle;
  • the part-of-speech score comparison table in the Chinese word segmentation system By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • the part of speech and the word frequency of the first participle are recorded in the hash table.
  • processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
  • this application is a first embodiment of a method for extracting locally optimized keywords.
  • the method for extracting locally optimized keywords includes:
  • Step S10 receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
  • the server determines the position of the title, the first paragraph and the end of the text. Specifically, when the server obtains the text to be processed, the title is generally located in the middle of the first line of the text to be processed. It may be in the upper line of a certain paragraph, and the title characters are generally in bold form.
  • the first paragraph is generally located in the second line of the text to be processed, and the characters in the first paragraph are generally the first space character (two characters of space), and the first space character to the second space in the second line is regarded as the text to be processed The first paragraph.
  • the end is located between the last character and the second space on the second line.
  • the server obtains the position of the space before the character in the text to be processed, so as to determine the position of the first paragraph and the end.
  • Call character recognition software scan the text to be processed, and obtain the characters in the title, first paragraph, and end of the text to be processed.
  • Step S20 based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segment in the word segmentation set to the keyword part of speech. ;
  • Chinese Word Segmentation refers to the segmentation of a sequence of Chinese characters into individual words.
  • Chinese word segmentation is the basis of text mining. For a piece of Chinese input, successfully performing Chinese word segmentation can achieve the effect of automatically identifying the meaning of the sentence.
  • Store all the words in the Chinese word segmentation system scan the processed text, find all possible words, and then see which word can be output.
  • text to be processed I am a student; words: I/Yes/student.
  • the server is calling the preset Chinese word segmentation system.
  • the server uses the Chinese analysis system to segment the characters in the title, first paragraph and end of the text to be processed, and reads the word segmentation in the title, first paragraph and end of the text to be processed.
  • Collect the read word segmentation to obtain the word segmentation set in the title, first paragraph and end of the text to be processed.
  • the word segmentation in the word segmentation set is used as the target word segmentation, and the part of speech of the target word segmentation is identified as the keyword part of speech.
  • Step S30 through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are the part-of-speech score and the word frequency;
  • the server retrieves the part of speech score table in the Chinese word segmentation system, based on the Chinese word segmentation system, obtains the part of speech of each target word segmentation in the word segmentation set, and obtains the corresponding part of each target word through the part of speech score table in the Chinese word segmentation system
  • the score value of is used as the weight parameter of the target word segmentation and the corresponding score value is recorded in the hash table.
  • Step S40 traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table.
  • the server starts to traverse the text to be processed. Specifically, the server calls character recognition software to traverse the text to be processed, recognizes all the characters in the text to be processed, and splits the recognized characters based on the preset Chinese word segmentation system.
  • the word segmentation in the word segmentation the obtained word segmentation is matched with the target participle in the word segmentation set.
  • the word segmentation is the target word segmentation
  • record the word frequency of the word segmentation, and the participle before and after the target word segmentation as the related participle and record the Associate the word frequency of the word segmentation, go to step 30, when the word segmentation is not the target word segmentation, match the next word segmentation until it matches all the word segmentation in the text to be processed;
  • Step S50 according to the keyword part of speech of the target word segmentation and the weight parameter of the part of speech of each related participle in the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the to-be-processed text.
  • the server After the server processes all the word segmentation in the text by matching, it sorts the keywords recorded in the hash table and the weight parameters corresponding to the associated word segmentation from largest to smallest, extracts the keywords corresponding to the top five weight parameters, and puts the weight parameters in the top five.
  • the corresponding keyword is determined as the target keyword, and the target keyword is used as the target keyword of the text to be processed.
  • the title, first paragraph, and end of the text are analyzed and segmented to obtain the word frequency and part of speech for multiple target analysis.
  • the part-of-speech and word frequency of the related participle of the target participle in the text to be processed the total part-of-speech value of each target analysis and related participle is obtained.
  • the part-of-speech score of the target participle in the central idea, the word frequency and the part-of-speech score of the related participle, Word frequency, the target word or related word segmentation with the highest total score is obtained as keywords, which reduces errors and improves the accuracy of text keywords.
  • FIG. 3 is a second embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S10 includes:
  • Step S11 receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
  • Step S12 Use the characters between the first space character position and the second space character position as the title of the text to be processed, and use the characters between the second space character position and the third space character position as the first paragraph of the text to be processed, and The space between the N-(N-1) space character position and the N space character position is used as the end of the text to be processed;
  • step S13 a preset character recognition program is called to recognize the characters in the title, the first paragraph and the last paragraph.
  • the server After receiving the processed text sent by the terminal, the server obtains the position of the space character and the number N of space characters in the text to be processed.
  • the specific implementation is that the server receives the text to be processed, scans the text to be processed, obtains the blank space of each line in the text to be processed, and records the position and the number N of the blank space.
  • the title is generally located on the first line of the text, and the first character of the title is generally two blank characters in the line.
  • the space between the second blank position and the third blank position is taken as the first paragraph of the text to be processed.
  • the N-th blank position and the N-(N-1)-th blank position are regarded as the end of the text to be processed.
  • the end character of the end of the text to be processed is not a blank character, but is a special symbol ".”, " !, "?", etc., treat them as blank characters.
  • the server invokes preset character recognition software, recognizes the title, first paragraph, and end paragraph of the processed text, and obtains all characters in the title, first paragraph, and end paragraph of the processed text.
  • the text is processed by obtaining the number and position of the space characters of the text to be processed, so as to obtain the title, first paragraph, and last paragraph of the text to be processed, and then obtain the title, first paragraph and the last paragraph through a character recognition program.
  • the characters in the last paragraph can quickly divide the to-be-processed text into the title, the first paragraph and the last paragraph through the space character.
  • Fig. 4 is a third embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in Fig. 2 above, step S20 includes:
  • Step S21 When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to classify the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words.
  • the part of speech is divided;
  • Step S22 Obtain the part-of-speech scores in the part-of-speech score comparison table of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with the part-of-speech score greater than 0 as the target word segmentation;
  • step S23 the target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
  • the server activates the preset Chinese word segmentation system when all characters in the title, first paragraph and end of the text to be processed are to be processed, and the characters automatically recognized by the Chinese word segmentation system are segmented.
  • the specific implementation is in the Chinese word segmentation system. Nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words are recorded.
  • the Chinese word segmentation system matches the acquired characters with the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. For example, first obtain one Characters are matched with recorded nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words.
  • the server obtains the Chinese word segmentation system to segment the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words in the title, first and last paragraphs, and obtain the characters of nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words.
  • the part of speech score in the Chinese word segmentation system compares the part of speech scores in the table, and nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words whose part of speech scores are greater than 0 are determined as target participles.
  • the nouns, verbs, adjectives, prepositions, punctuation, quantifiers and new words are grouped into word segmentation, that is, there are two identical nouns, only one is kept, and the part of speech of the target participle in the participle set is updated, and the target participle is updated to the keyword part of speech.
  • the part of speech of the target participle is nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms.
  • the part of speech such as nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms are identified as keywords.
  • the title, the first paragraph, and the last paragraph are segmented through a preset Chinese analysis system to obtain different characters, and then the part-of-speech score of each character is obtained through the part-of-speech score comparison table, and this item is divided. Characters with a value greater than 0 are determined as the target word segmentation, and the part of speech of the target word segmentation is keyword nature, and the target word segmentation in the title, first paragraph and last paragraph can be extracted quickly and accurately.
  • FIG. 5 is a fourth embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S30 includes:
  • Step S31 retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
  • step S32 the target word segmentation is used as a search condition, and the word frequency of each target word segmentation in the title, first paragraph and end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
  • the server retrieves the part-of-speech score comparison table in the preset Chinese word segmentation system.
  • the part-of-speech score comparison table records the part-of-speech scores of nouns, verbs, adjectives, prepositions, punctuation, quantifiers, keywords, and new words.
  • the specific table is as follows :
  • the part-of-speech score table by comparing the part-of-speech score table, the part-of-speech score of each target word segment is obtained, and through the index, the word frequency of each target word segment in the title, the first paragraph and the last paragraph is obtained, and the obtained word frequency and part-of-speech are recorded
  • the frequency and part of speech of each target word in the title, first paragraph and last paragraph can be quickly obtained.
  • FIG. 6 is a fifth embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S40 includes:
  • Step S41 traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;
  • Step S42 extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;
  • Step S43 when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the part of speech and word frequency of the related participle are obtained;
  • Step S44 By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
  • the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation; extract the first word segmentation in the text to be processed, Determine whether the first participle is the target participle in the word participle set; when the first participle is the target participle in the word participle set, read the second and third participles before and after the first participle, specifically, the server obtains Chinese The word segmentation position segmented by the word segmentation system, extract the first word segmentation in the text to be processed, when the first word segmentation is the target word segmentation, read the part of speech and word frequency of the second word segmentation and the third word segmentation, and obtain the association
  • the part of speech comparison of word segmentation compares the part of speech score comparison table in the Zhongnongwen word segmentation system to obtain the part of speech score corresponding to the related word segmentation, and record the part of speech score and word frequency of the related word segment
  • the preset character recognition software is started to traverse the text to be processed, to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Word segmentation; extract the first participle in the text to be processed, and determine whether the first participle is the target participle in the word segmentation set; when the first participle is the target participle in the word segmentation set, read the second participle before and after the first word segmentation And the third word segmentation, to quickly obtain the related word segmentation of the target word segmentation in the text to be processed.
  • FIG. 7 is a seventh embodiment of the method for extracting locally optimized keywords according to this application. Based on the embodiment shown in FIG. 2, after step S50, the method further includes:
  • Step S51 Obtain preset calculation rules, and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
  • Step S52 by sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the top five total score values.
  • the target participle and/or related participle of is the key word of the text to be processed.
  • the server is obtaining the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table through the preset calculation rules. Specifically, it obtains the word frequency of any target word segmentation. The word frequency is also in the text to be processed.
  • Process the number of target word segmentation and the corresponding part-of-speech score in the process multiply the word frequency by the part-of-speech score to get the total score of the target word segmentation, calculate the total score of all the target word segmentation and related word segmentation in the hash table, and pass Sort the total scores of the target segmentation and the related segmentation in the order from largest to smallest and from smallest to largest, and the top five with the largest total score are the target or related word segmentation, and the top five with the largest total score are extracted as the target Word segmentation or related word segmentation is the key word of the text to be processed.
  • the server is acquiring preset calculation rules, and calculates the total score of each target segmentation and associated word segmentation in the hash table through the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table. Sort from big to small and from small to big. The top five with the largest total score are the target word segmentation or related word segmentation, and the top five with the largest total score are extracted as the target word segmentation or related word segmentation as the key word of the text to be processed . Thereby reducing errors and improving the accuracy of text keywords.
  • an embodiment of the present application also proposes a device for extracting locally optimized keywords.
  • the device for extracting locally optimized keywords includes:
  • the recognition unit is used to receive the text to be processed, and to recognize the characters in the title, first paragraph and the last paragraph of the text to be processed;
  • the update unit is used to segment the characters in the title, the first paragraph and the end based on the preset Chinese word segmentation system, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segmentation in the word segmentation set as the key Part of speech
  • the first recording unit is used to record the weight parameters corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are the part-of-speech score and the word frequency;
  • the second recording unit is used to traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameters of the related word segmentation in the hash table;
  • the extraction unit is used to extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table.
  • the above-mentioned recognition unit is specifically configured to: receive the text to be processed, and obtain the position of the space character in the text to be processed and the number N of space characters, where the number of space characters N is greater than 3;
  • the above-mentioned update unit is specifically used for: when the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, and prepositions. , Punctuation, quantifiers, and neologisms are divided into parts of speech;
  • part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system and determine the characters with a part-of-speech score greater than 0 as the target participle;
  • the target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
  • the above-mentioned first recording unit is specifically used to: retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
  • the target word segmentation is used as the search condition to index the word frequency of each target word segmentation in the title, first paragraph and end, and the score value and word frequency of each target word segmentation are recorded in the hash table.
  • the second recording unit includes: a recognition subunit for traversing the text to be processed through a preset character recognition software, and recognizing characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Participle
  • the first judgment subunit is used for extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
  • the first determination subunit is used to determine when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech of the related participle And word frequency;
  • the acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the hash table.
  • the above-mentioned device for extracting locally optimized keywords further includes:
  • the second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;
  • the second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
  • the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of large data. Disclosed is a method for extracting locally optimized keywords, comprising: receiving a text to be processed, and identifying characters in the title, the first paragraph and the last paragraph of the text to be processed; acquiring, on the basis of a preset Chinese word segmentation system, target segmented words in the title, the first paragraph and the last paragraph, and updating the part of speech of the target segmented words to the part of speech of keywords; recording, in a preset hash table, weight parameters corresponding to the target segmented words by means of a part-of-speech score comparison table in the Chinese word segmentation system; traversing the text to be processed, so as to acquire associated segmented words of the target segmented words and the part of speech of the associated segmented words, and recording, in the hash table, weight parameters of the associated segmented words; and extracting target segmented words and/or associated segmented words which have total scores in the top five as the keywords of the text to be processed. Further disclosed are an apparatus for extracting a locally optimized keywords, a server and a storage medium. According to the target segmented word in the essential concept, errors are reduced, and the accuracy of text keywords is improved.

Description

局部优化关键词的方法、装置、设备及存储介质Method, device, equipment and storage medium for locally optimizing keywords
本申请要求于2019年9月19日提交中国专利局、申请号为201910884825.7,发明名称为“局部优化关键词的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 19, 2019, the application number is 201910884825.7, and the invention title is "Methods, Devices, Equipment, and Storage Media for Partially Optimizing Keywords", the entire content of which is approved The reference is incorporated in the application.
技术领域Technical field
本申请涉及大数据技术领域,尤其涉及一种局部优化关键词的提取方法、装置、服务器及计算机可读存储介质。This application relates to the field of big data technology, and in particular to a method, device, server, and computer-readable storage medium for extracting locally optimized keywords.
背景技术Background technique
在自然语言处理研究中,关键词代表着文本的中心思想,对文本检索及文本分类等工作发挥着重大作用,因此关键词提取技术受到大量学者重视。由于传统基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想。当前,大多数关键词提取算法均在传统统计特征算法上加入了分词的关联关系等特性,从而得到最终关键词。其中不少国内外学者基于tf-idf的加权词频来过滤大量出现在语料库中的分词,但其严重依赖于语料库数量,有可能将分词重要性偏离其正常值。发明人意识到基于复杂网络的关键词提取方法虽然考虑了分词关联度,但其过分关注“小世界”特性,忽略了“大世界”影响力及文本内容层次的中心思想,从而导致关键词提取准确度较低。In the research of natural language processing, keywords represent the central idea of the text, and play an important role in text retrieval and text classification. Therefore, keyword extraction technology is valued by a large number of scholars. Due to the traditional keyword method based on statistical features, it pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignores the overall central idea of the article. At present, most keyword extraction algorithms add features such as the association relationship of word segmentation to the traditional statistical feature algorithm, so as to obtain the final keyword. Many scholars at home and abroad filter out a large number of word segments appearing in the corpus based on the weighted word frequency of tf-idf, but it depends heavily on the number of corpora, which may deviate the importance of the word segmentation from its normal value. The inventor realized that although the keyword extraction method based on complex networks considers the degree of word segmentation, it pays too much attention to the characteristics of the "small world", ignoring the influence of the "big world" and the central idea of the text content level, resulting in keyword extraction The accuracy is low.
发明内容Summary of the invention
本申请的主要目的在于提供一种局部优化关键词的提取方法,旨在解决现有技术仅基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想,从而导致关键词不准确的技术问题。The main purpose of this application is to provide a method for extracting locally optimized keywords, which aims to solve the problem that the prior art keyword methods are based only on statistical features, which pay too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, and ignore the article’s The overall central idea, leading to technical problems of inaccurate keywords.
为实现上述目的,本申请提供一种局部优化关键词的提取方法,所述局部优化关键词的提取方法包括:To achieve the above objective, the present application provides a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:
接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;
通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
此外,为实现上述目的,本申请还提供一种局部优化关键词的提取装置,所述局部优化关键词的提取装置包括:In addition, in order to achieve the above objective, the present application also provides a device for extracting locally optimized keywords, the device for extracting locally optimized keywords includes:
识别单元,用于接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;
更新单元,用于基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;The update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set. The part of speech of the target participle is the keyword part of speech;
第一记录单元,用于通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;The first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;
第二记录单元,用于遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;The second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
提取单元,用于根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。The extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.
此外,为实现上述目的,本申请还提供一种服务器,所述服务器包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的局部优化关键词的提取程序,所述局部优化关键词的提取程序被所述处理器执行时实现如上申请所述的局部优化关键词的提取方法的步骤。In addition, in order to achieve the above object, the present application also provides a server, the server includes: a memory, a processor, and a locally optimized keyword extraction program stored on the memory and running on the processor, so When the program for extracting locally optimized keywords is executed by the processor, the steps of the method for extracting locally optimized keywords as described in the above application are implemented.
此外,为实现上述目的,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行上述局部优化关键词的提取方法。In addition, in order to achieve the above objective, the present application also provides a computer-readable storage medium in which computer instructions are stored. When the computer instructions are run on a computer, the computer can execute the above-mentioned partial optimization. Keyword extraction method.
本申请实施例提出的一种局部优化关键词的提取方法、装置、服务器及计算机可读存储介质,接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词,实现了基于中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。The method, device, server, and computer-readable storage medium for extracting locally optimized keywords proposed in the embodiments of the present application receive the text to be processed, and recognize the characters in the title, first paragraph, and last paragraph of the text to be processed; Preset Chinese word segmentation system to segment the characters in the title, first paragraph and end, and obtain the word segmentation set of the title, first paragraph and end, and update the part of speech of the target word in the word segmentation set as the key Part of speech; through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the to-be Process the text, obtain the related participle of the target participle and the part of speech of the related participle, and record the weight parameter of the related participle in the hash table; according to the keyword part of speech of the target participle and the part of speech of each related participle In the weight parameter of the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed, and realize the part-of-speech score and word frequency based on the target word segmentation in the central idea As well as the part-of-speech score and word frequency of the related word segmentation, the target word or related word segmentation with the highest total score is obtained as the keyword, which reduces the error and improves the accuracy of the text keyword.
附图说明Description of the drawings
图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图;FIG. 1 is a schematic diagram of a server structure of a hardware operating environment involved in a solution of an embodiment of the application;
图2为本申请局部优化关键词的提取方法的第一实施例的流程示意图;2 is a schematic flowchart of a first embodiment of a method for extracting locally optimized keywords according to this application;
图3为图2中步骤S10的细化流程示意图;FIG. 3 is a schematic diagram of the detailed flow of step S10 in FIG. 2;
图4为图2中步骤S20的细化流程示意图;FIG. 4 is a schematic diagram of the detailed flow of step S20 in FIG. 2;
图5为图2中步骤S30的细化流程示意图;FIG. 5 is a detailed flowchart of step S30 in FIG. 2;
图6为本申请局部优化关键词的提取方法的第二实施例的流程示意图;6 is a schematic flowchart of a second embodiment of a method for extracting partially optimized keywords according to this application;
图7为图2中步骤S50的细化流程示意图。Fig. 7 is a detailed flowchart of step S50 in Fig. 2.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请实施例的主要解决方案是:接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端中的分词集合,更新分词结集合中的目标分词的词性为关键词词性;通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词 为待处理文本的关键词。The main solution of the embodiment of this application is to receive the text to be processed, and identify the characters in the title, first paragraph and the last paragraph of the text to be processed; based on the preset Chinese word segmentation system, perform the characterization of the title, the first paragraph and the last paragraph. Segmentation, and obtain the word segmentation set in the title, first paragraph and end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech; through the part of speech score comparison table in the Chinese word segmentation system, the weight corresponding to each target word segmentation The parameters are recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency; traverse the text to be processed, obtain the related participle of the target word segmentation and the part-of-speech of the related participle, and record the weight parameters of the related word segmentation in the hash table In; according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table, extract the top five target participles and/or related participles of the total score as the keywords of the text to be processed.
由于现有技术基于统计特征的关键词方法,过分关注于分词的属性,如词性、词频、位置,忽略了文章的整体中心思想,从而导致关键词不准确的技术问题。Because the prior art keyword method based on statistical features pays too much attention to the attributes of word segmentation, such as part of speech, word frequency, and position, it ignores the overall central idea of the article, which leads to technical problems of inaccurate keywords.
本申请提供一种解决方案,通过中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。This application provides a solution. Through the part-of-speech score and word frequency of the target word segmentation and the part-of-speech score and word frequency of the related participle in the central idea, the target word or related participle with the highest total score is obtained as a keyword, which reduces the error. Improve the accuracy of text keywords.
如图1所示,图1为本申请实施例方案涉及的硬件运行环境的服务器结构示意图。As shown in FIG. 1, FIG. 1 is a schematic diagram of the server structure of the hardware operating environment involved in the solution of the embodiment of the application.
本申请实施例终端为服务器。The terminal in the embodiment of this application is a server.
如图1所示,该终端可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及局部优化关键词的提取程序。As shown in FIG. 1, the memory 1005, which is a computer storage medium, may include an operating system, a network communication module, a user interface module, and a program for extracting partially optimized keywords.
在图1所示的终端中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的局部优化关键词的提取程序,并执行以下操作:In the terminal shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server; the user interface 1003 is mainly used to connect to a client (user side) to communicate with the client; and the processor 1001 can be used to call the extraction program of locally optimized keywords stored in the memory 1005, and perform the following operations:
接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Receive the text to be processed, and identify the characters in the title, first paragraph and last paragraph of the text to be processed;
基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to the keyword part of speech;
通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part-of-speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and the character between the second space character position and the third space character position as the first paragraph of the text to be processed, and set N-( N-1) The space between the space character position and the N space character position is used as the end of the text to be processed;
调取预置字符识别程序,识别标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to perform the part of speech of the characters in the title, the first paragraph and the last paragraph according to the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. Divide
获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with a part-of-speech score greater than 0 as the target participle;
将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。The target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
调取预置中文分词系统中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
分别将目标分词作为搜索条件,索引各个目标分词在所述标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。The target word segmentation is used as the search condition, and the word frequency of each target word segmentation in the title, the first paragraph and the end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
通过预置字符识别程序遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system to divide the characters in the text to be processed into multiple word segments;
提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;Extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;When the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech and word frequency of the related participle;
通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序,还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
当第一分词不是所述分词集合中的目标分词时,判断第一分词是否为目标分词的关联分词;When the first participle is not the target participle in the word segmentation set, judge whether the first participle is the related participle of the target participle;
在判定第一分词为所述目标分词的关联分词时,将第一分词的词性和词频记录在哈希表中。When it is determined that the first participle is the related participle of the target participle, the part of speech and the word frequency of the first participle are recorded in the hash table.
进一步地,处理器1001可以调用存储器1005中存储的局部优化关键词的提取程序, 还执行以下操作:Further, the processor 1001 may call the locally optimized keyword extraction program stored in the memory 1005, and also perform the following operations:
获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain the preset calculation rules and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation, and extract the top five target words with the total score value extracted And/or related word segmentation is the key word of the text to be processed.
参照图2,本申请为局部优化关键词的提取方法的第一实施例,所述局部优化关键词的提取方法包括:2, this application is a first embodiment of a method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:
步骤S10,接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Step S10, receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
服务器在接收到终端发送的待处理文本时,确定该文本的标题、首段和尾端的位置,具体为服务器在获取到待处理文本,标题一般处于待处理文本的首行最中间的位置,也可能处于某一段的上一行,且标题字符一般用加粗的形式。首段一般位于待处理文本的第二行且首段的字符前一般是第一空格字符(空格两位字符),将第二行的第一空格字符至到第二空格之间作为待处理文本的首段。尾端位于最后一个字符至第二行的第二空格之间。服务器在获取到待处理文本中字符前的空格位置,从而确定首段和尾端的位置。调取字符识别软件,扫描该待处理文本,获取该待处理文本的标题、首段和尾端中的字符。When the server receives the text to be processed from the terminal, it determines the position of the title, the first paragraph and the end of the text. Specifically, when the server obtains the text to be processed, the title is generally located in the middle of the first line of the text to be processed. It may be in the upper line of a certain paragraph, and the title characters are generally in bold form. The first paragraph is generally located in the second line of the text to be processed, and the characters in the first paragraph are generally the first space character (two characters of space), and the first space character to the second space in the second line is regarded as the text to be processed The first paragraph. The end is located between the last character and the second space on the second line. The server obtains the position of the space before the character in the text to be processed, so as to determine the position of the first paragraph and the end. Call character recognition software, scan the text to be processed, and obtain the characters in the title, first paragraph, and end of the text to be processed.
步骤S20,基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Step S20, based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segment in the word segmentation set to the keyword part of speech. ;
中文分词系统(Chinese Word Segmentation)指的是将一个汉字字符序列切分成一个一个单独的词。中文分词是文本挖掘的基础,对于输入的一段中文,成功的进行中文分词,可以达到自动识别语句含义的效果。把所有的词都存入中文分词系统中,扫描带处理的文本,查找所有可能的词,然后看哪个词可以作为输出。如:待处理文本:我是学生;词:我/是/学生。服务器在调取预置中文分词系统,服务器通过中文分析系统对待处理文本中的标题、首段和尾端中的字符进行切分,读取待处理文本标题、首段和尾端中的分词,将读取到的分词进行集合,得到该待处理文本标题、首段和尾端中的分词集合。将分词集合中的分词作为目标分词,并将目标分词的词性标识为关键词词性。Chinese Word Segmentation refers to the segmentation of a sequence of Chinese characters into individual words. Chinese word segmentation is the basis of text mining. For a piece of Chinese input, successfully performing Chinese word segmentation can achieve the effect of automatically identifying the meaning of the sentence. Store all the words in the Chinese word segmentation system, scan the processed text, find all possible words, and then see which word can be output. Such as: text to be processed: I am a student; words: I/Yes/student. The server is calling the preset Chinese word segmentation system. The server uses the Chinese analysis system to segment the characters in the title, first paragraph and end of the text to be processed, and reads the word segmentation in the title, first paragraph and end of the text to be processed. Collect the read word segmentation to obtain the word segmentation set in the title, first paragraph and end of the text to be processed. The word segmentation in the word segmentation set is used as the target word segmentation, and the part of speech of the target word segmentation is identified as the keyword part of speech.
步骤S30,通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Step S30, through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in a preset hash table, where the weight parameters are the part-of-speech score and the word frequency;
服务器在获取到分词集合时,调取中文分词系统中的词性分数表,基于中文分词系统,获取分词集合中每个目标分词的词性,通过中文分词系统中的词性分数表,获取各个目标分词对应的分数值,将该分数值作为该目标分词的权重参数并将对应的分值记录在哈希表中。When the server obtains the word segmentation set, it retrieves the part of speech score table in the Chinese word segmentation system, based on the Chinese word segmentation system, obtains the part of speech of each target word segmentation in the word segmentation set, and obtains the corresponding part of each target word through the part of speech score table in the Chinese word segmentation system The score value of is used as the weight parameter of the target word segmentation and the corresponding score value is recorded in the hash table.
步骤S40,遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中。Step S40, traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table.
服务器开始对待处理文本进行遍历,具体为服务器调取字符识别软件对待处理文本进行遍历,识别待处理文本中所有的字符,基于预置中文分词系统对识别的字符进行切分, 在获取到待处理中的分词时,将获取到的分词与分词集合中的目标分词进行匹配,当该分词为目标分词时,记录该分词出现的词频,以及将该目标分词前后的分词作为关联分词,并记录该关联分词的词频,执行步骤30,当该分词不是目标分词时,进行下一分词的匹配,直至匹配待处理文本中所有的分词;The server starts to traverse the text to be processed. Specifically, the server calls character recognition software to traverse the text to be processed, recognizes all the characters in the text to be processed, and splits the recognized characters based on the preset Chinese word segmentation system. When the word segmentation in the word segmentation, the obtained word segmentation is matched with the target participle in the word segmentation set. When the word segmentation is the target word segmentation, record the word frequency of the word segmentation, and the participle before and after the target word segmentation as the related participle, and record the Associate the word frequency of the word segmentation, go to step 30, when the word segmentation is not the target word segmentation, match the next word segmentation until it matches all the word segmentation in the text to be processed;
步骤S50,根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。Step S50, according to the keyword part of speech of the target word segmentation and the weight parameter of the part of speech of each related participle in the hash table, extract the top five target word segmentation and/or related word segmentation as the keywords of the to-be-processed text.
服务器通过匹配处理文本中所有的分词后,将哈希表中记载的各个关键词以及关联分词对应的权重参数从大到小进行排序,提取权重参数前五对应的关键词,将权重参数前五对应的关键词确定为目标关键词,将该目标关键词作为该待处理文本的目标关键词。After the server processes all the word segmentation in the text by matching, it sorts the keywords recorded in the hash table and the weight parameters corresponding to the associated word segmentation from largest to smallest, extracts the keywords corresponding to the top five weight parameters, and puts the weight parameters in the top five The corresponding keyword is determined as the target keyword, and the target keyword is used as the target keyword of the text to be processed.
在本实施例中,通过将文本的标题、首段和尾端作为文本的中心思想,对待处理文本的标题、首段和尾端进行分析切分,获取到多个目标分析的词频和词性,在通过获取待处理文本中目标分词的关联分词的词性和词频,来获取各个目标分析和关联分词的词性总值,中心思想中的目标分词的词性分值、词频以及关联分词的词性分值、词频,得到总分值最高目标分词或关联分词为关键词,减小了误差,提高了文本关键词的准确性。In this embodiment, by taking the title, first paragraph, and end of the text as the central idea of the text, the title, first paragraph, and end of the text to be processed are analyzed and segmented to obtain the word frequency and part of speech for multiple target analysis. By obtaining the part-of-speech and word frequency of the related participle of the target participle in the text to be processed, the total part-of-speech value of each target analysis and related participle is obtained. The part-of-speech score of the target participle in the central idea, the word frequency and the part-of-speech score of the related participle, Word frequency, the target word or related word segmentation with the highest total score is obtained as keywords, which reduces errors and improves the accuracy of text keywords.
进一步的,参照图3,图3为本申请局部优化关键词的提取方法提供的第二实施例,基于上述图2所示的实施例,步骤S10包括:Further, referring to FIG. 3, FIG. 3 is a second embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S10 includes:
步骤S11,接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Step S11, receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
步骤S12,将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Step S12: Use the characters between the first space character position and the second space character position as the title of the text to be processed, and use the characters between the second space character position and the third space character position as the first paragraph of the text to be processed, and The space between the N-(N-1) space character position and the N space character position is used as the end of the text to be processed;
步骤S13,调取预置字符识别程序,识别标题、首段和尾段中的字符。In step S13, a preset character recognition program is called to recognize the characters in the title, the first paragraph and the last paragraph.
服务器在接收到终端发送的处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N。具体实施方式为服务器接收待处理文本,对待处理文本进行扫描,获取待处理文本中的每一行的空白处,并记录该空白处出的位置以及数量N。将第一空白位置处与第二空白位置处之间作为该待处理文本的标题。标题一般位于文本的第一行,且标题的首字符一般在该行空白两个字符。将第二空白位置处与第三空白位置处之间作为该待处理文本的首段。将第N空白位置处与第N-(N-1)空白位置处作为该待处理文本的尾端,例如,该待处理文本的尾段结尾字符不是空白字符,是特殊符号“。”、“!”、“?”等时,将其作为空白字符。服务器调取预置字符识别软件,对该处理文本的标题、首段和尾段进行识别,获取该处理文本的标题、首段和尾段中所有的字符。After receiving the processed text sent by the terminal, the server obtains the position of the space character and the number N of space characters in the text to be processed. The specific implementation is that the server receives the text to be processed, scans the text to be processed, obtains the blank space of each line in the text to be processed, and records the position and the number N of the blank space. Use the space between the first blank position and the second blank position as the title of the text to be processed. The title is generally located on the first line of the text, and the first character of the title is generally two blank characters in the line. The space between the second blank position and the third blank position is taken as the first paragraph of the text to be processed. The N-th blank position and the N-(N-1)-th blank position are regarded as the end of the text to be processed. For example, the end character of the end of the text to be processed is not a blank character, but is a special symbol ".", " !", "?", etc., treat them as blank characters. The server invokes preset character recognition software, recognizes the title, first paragraph, and end paragraph of the processed text, and obtains all characters in the title, first paragraph, and end paragraph of the processed text.
在本实施例中,通过获取待处理文本的空格字符的数量以及位置,将文本进行处理,从而获取到待处理文本的标题、首段和尾段,再通过字符识别程序获取标题、首段和尾段中的字符,通过空格字符快速的将待处理文本分为标题、首段和尾段。In this embodiment, the text is processed by obtaining the number and position of the space characters of the text to be processed, so as to obtain the title, first paragraph, and last paragraph of the text to be processed, and then obtain the title, first paragraph and the last paragraph through a character recognition program. The characters in the last paragraph can quickly divide the to-be-processed text into the title, the first paragraph and the last paragraph through the space character.
参照图4,图4为本申请局部优化关键词的提取方法提供的第三实施例,基于上述图2 所示的实施例,步骤S20,包括:Referring to Fig. 4, Fig. 4 is a third embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in Fig. 2 above, step S20 includes:
步骤S21,在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;Step S21: When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to classify the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. The part of speech is divided;
步骤S22,获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Step S22: Obtain the part-of-speech scores in the part-of-speech score comparison table of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with the part-of-speech score greater than 0 as the target word segmentation;
步骤S23,将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。In step S23, the target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
服务器在是被待该待处理文本标题、首段和尾端中的所有字符时,启动预置中文分词系统,中文分词系统自动识别到的字符进行切分,具体实施方式为,中文分词系统中记载有名词、动词、形容词、介词、标点、量词以及新词,中文分词系统将获取到的字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,例如,首先获取一个字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,当匹配不成功时,获取两个字符与记载的名词、动词、形容词、介词、标点、量词以及新词进行匹配,直至匹配成功。服务器获取中文分词系统切分标题、首段和尾段中名词、动词、形容词、介词、标点、量词以及新词,获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将名词、动词、形容词、介词、标点、量词以及新词的词性分数大于0的字符确定为目标分词。将名词、动词、形容词、介词、标点、量词以及新词进行分词集合,即有两个相同的名词,只保留一个,并更新分词集合中目标分词的词性,将目标分词更新为关键词词性,目标分词的词性为名词、动词、形容词、介词、标点、量词以及新词等词性,将名词、动词、形容词、介词、标点、量词以及新词等词性标识为关键词词性。The server activates the preset Chinese word segmentation system when all characters in the title, first paragraph and end of the text to be processed are to be processed, and the characters automatically recognized by the Chinese word segmentation system are segmented. The specific implementation is in the Chinese word segmentation system. Nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words are recorded. The Chinese word segmentation system matches the acquired characters with the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words. For example, first obtain one Characters are matched with recorded nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. When the match is unsuccessful, get two characters to match the recorded nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words Until the match is successful. The server obtains the Chinese word segmentation system to segment the nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words in the title, first and last paragraphs, and obtain the characters of nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and new words. The part of speech score in the Chinese word segmentation system compares the part of speech scores in the table, and nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and new words whose part of speech scores are greater than 0 are determined as target participles. The nouns, verbs, adjectives, prepositions, punctuation, quantifiers and new words are grouped into word segmentation, that is, there are two identical nouns, only one is kept, and the part of speech of the target participle in the participle set is updated, and the target participle is updated to the keyword part of speech. The part of speech of the target participle is nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms. The part of speech such as nouns, verbs, adjectives, prepositions, punctuation, quantifiers, and neologisms are identified as keywords.
在本实施例中,通过预置中文分析系统对标题、首段和尾段进行切分,获取到不同的字符,再通过词性分数对照表获取到各个字符的词性分值,并将此项分值大于0的字符确定为目标分词,且目标分词的词性为关键词性,快速、准确的提取到标题、首段和尾段中的目标分词。In this embodiment, the title, the first paragraph, and the last paragraph are segmented through a preset Chinese analysis system to obtain different characters, and then the part-of-speech score of each character is obtained through the part-of-speech score comparison table, and this item is divided. Characters with a value greater than 0 are determined as the target word segmentation, and the part of speech of the target word segmentation is keyword nature, and the target word segmentation in the title, first paragraph and last paragraph can be extracted quickly and accurately.
参照图5,图5为本申请局部优化关键词的提取方法提供的第四实施例,基于上述图2所示的实施例,步骤S30包括:Referring to FIG. 5, FIG. 5 is a fourth embodiment provided by the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S30 includes:
步骤S31,调取预置中文分词系统中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Step S31, retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
步骤S32,分别将目标分词作为搜索条件,索引各个所述目标分词在标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。In step S32, the target word segmentation is used as a search condition, and the word frequency of each target word segmentation in the title, first paragraph and end is indexed, and the score value and word frequency of each target word segmentation are recorded in a hash table.
服务器调取预置中文分词系统中的词性分数对照表,词性分数对照表中记录有名词、动词、形容词、介词、标点、量词、关键词以及新词等词性的分数值,具体表格如下所示:The server retrieves the part-of-speech score comparison table in the preset Chinese word segmentation system. The part-of-speech score comparison table records the part-of-speech scores of nouns, verbs, adjectives, prepositions, punctuation, quantifiers, keywords, and new words. The specific table is as follows :
词性Part of speech 分数fraction
名词(n)Noun (n) 3.03.0
动词(v)Verb (v) 2.02.0
形容词(a)Adjective (a) 1.01.0
介词(p)Preposition (p) 0.00.0
标点(w)Punctuation (w) 0.00.0
量词(m)Quantifier (m) 0.00.0
关键词(kw)Keywords (kw) 4.04.0
新词(nw)New words (nw) 3.03.0
对照分数词性对照表,获取到关键词词性对应的分数值为3.0,在标题、首段和尾端中搜索获取到的分词集合中的各个目标分词的词频,将获取到的各个目标分词的词频以及对应的关键词分数值记录在哈希表中。Compare the score part of speech comparison table, get the score value of the keyword part of speech corresponding to 3.0, search for the word frequency of each target word in the word segmentation set obtained in the title, first paragraph and end, and get the word frequency of each target word And the corresponding keyword score value is recorded in the hash table.
在本实施例中,通过对照词性分数表,获取各个目标分词的词性分值,并通过索引,获取到各个目标分词在标题、首段和尾段中的词频,将获取到的词频和词性记录在哈希表,从而快速的获取到各个目标分词在标题、首段和尾段中的词频和词性。In this embodiment, by comparing the part-of-speech score table, the part-of-speech score of each target word segment is obtained, and through the index, the word frequency of each target word segment in the title, the first paragraph and the last paragraph is obtained, and the obtained word frequency and part-of-speech are recorded In the hash table, the frequency and part of speech of each target word in the title, first paragraph and last paragraph can be quickly obtained.
参照图6,图6为本申请局部优化关键词的提取方法提供的第五实施例,基于上述图2所示的实施例,步骤S40包括:Referring to FIG. 6, FIG. 6 is a fifth embodiment of the method for extracting locally optimized keywords in this application. Based on the embodiment shown in FIG. 2, step S40 includes:
步骤S41,通过预置字符识程序遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Step S41, traverse the text to be processed through a preset character recognition program, recognize characters in the text to be processed, and a preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;
步骤S42,提取待处理文本中的第一分词,判断第一分词是否为分词集合中的目标分词;Step S42, extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;
步骤S43,当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;Step S43, when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the part of speech and word frequency of the related participle are obtained;
步骤S44,通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。Step S44: By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
启动预置字符识软件遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;当第一分词为分词集合中的目标分词时,读取第一分词的前后第二分词和第三分词,具体为,服务器获取到中文分词系统切分的分词位置,提取待处理文本中的第一分词,当第一分词为目标分词时,读取所述第二分词和所述第三分词的词性以及词频,将获取到的关联分词的词性比对中农问分词系统中的词性分数对照表,获取关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。当第一分词之前的第二分词或之后的第三分词为空白字符或特殊符号时,则不读取第三分词或第二分词,获取下一分词。Start the preset character recognition software to traverse the text to be processed, identify the characters in the text to be processed, the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation; extract the first word segmentation in the text to be processed, Determine whether the first participle is the target participle in the word participle set; when the first participle is the target participle in the word participle set, read the second and third participles before and after the first participle, specifically, the server obtains Chinese The word segmentation position segmented by the word segmentation system, extract the first word segmentation in the text to be processed, when the first word segmentation is the target word segmentation, read the part of speech and word frequency of the second word segmentation and the third word segmentation, and obtain the association The part of speech comparison of word segmentation compares the part of speech score comparison table in the Zhongnongwen word segmentation system to obtain the part of speech score corresponding to the related word segmentation, and record the part of speech score and word frequency of the related word segmentation in the hash table. When the second participle before the first participle or the third participle after the first participle is a blank character or a special symbol, the third participle or the second participle is not read, and the next participle is obtained.
当服务器判定第一分词不是分词集合中的分表分词时,判断第一分词是否为目标分词的关联分词。具体为,当识别第一分词的字符=时,将第一分析的字符与目标分词的字符进行比对,当第一分词的字符与目标分粗的字符不相同时,将第一分词的字符与目标分词的 关联分词的字符进行比对,判断第一分词是否为关联分词,的那个第一分词的字符与关联分词的字符比对一致时,将第一分词的词性和词频记录到哈希表中,且词频为记录一次。When the server determines that the first participle is not a participle in the word segmentation set, it determines whether the first participle is an associated participle of the target participle. Specifically, when the character of the first participle is recognized=, the first analyzed character is compared with the character of the target participle. When the character of the first participle is not the same as the character of the target participle, the character of the first participle is Compare the characters of the related participle of the target participle to determine whether the first participle is a related participle. When the characters of the first participle match the characters of the related participle, record the part of speech and word frequency of the first participle to the hash In the table, and the word frequency is recorded once.
在本实施例中,在本实施例中,启动预置字符识软件遍历所述待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;提取待处理文本中的第一分词,判断第一分词是否为所述分词集合中的目标分词;当第一分词为分词集合中的目标分词时,读取第一分词的前后第二分词和第三分词,快速的获取到待处理文本中目标分词的关联分词。In this embodiment, in this embodiment, the preset character recognition software is started to traverse the text to be processed, to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Word segmentation; extract the first participle in the text to be processed, and determine whether the first participle is the target participle in the word segmentation set; when the first participle is the target participle in the word segmentation set, read the second participle before and after the first word segmentation And the third word segmentation, to quickly obtain the related word segmentation of the target word segmentation in the text to be processed.
参照图7,图7为本申请局部优化关键词的提取方法提供的第七实施例,,基于上述图2所示的实施例,步骤S50之后,还包括:Referring to FIG. 7, FIG. 7 is a seventh embodiment of the method for extracting locally optimized keywords according to this application. Based on the embodiment shown in FIG. 2, after step S50, the method further includes:
步骤S51,获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Step S51: Obtain preset calculation rules, and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
步骤S52,通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。Step S52, by sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the top five total score values. The target participle and/or related participle of is the key word of the text to be processed.
服务器在获取预置计算规则,通过预置计算规则,计算出该哈希表中各个目标分词和关联分词的总分值,具体为,获取任意一个目标分词的词频,词频也就是在待处理文本中处理目标分词的次数,以及对应的词性分值,将词频乘以词性分值,得到该目标分词的总分值,计算出哈希表中所有的目标分词和关联分词的总分值,通过将目标分词和关联分词的总分值按照从大到小和从小到大的排列顺序进行排序,得到总分值最大的前五为目标分词或关联分词,提取总分值最大的前五为目标分词或关联分词为待处理文本的关键词。The server is obtaining the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table through the preset calculation rules. Specifically, it obtains the word frequency of any target word segmentation. The word frequency is also in the text to be processed. Process the number of target word segmentation and the corresponding part-of-speech score in the process, multiply the word frequency by the part-of-speech score to get the total score of the target word segmentation, calculate the total score of all the target word segmentation and related word segmentation in the hash table, and pass Sort the total scores of the target segmentation and the related segmentation in the order from largest to smallest and from smallest to largest, and the top five with the largest total score are the target or related word segmentation, and the top five with the largest total score are extracted as the target Word segmentation or related word segmentation is the key word of the text to be processed.
在本实施例中,服务器在获取预置计算规则,通过预置计算规则,计算出该哈希表中各个目标分词和关联分词的总分值,通过将目标分词和关联分词的总分值按照从大到小和从小到大的排列顺序进行排序,得到总分值最大的前五为目标分词或关联分词,提取总分值最大的前五为目标分词或关联分词为待处理文本的关键词。从而减小了误差,提高了文本关键词的准确性。In this embodiment, the server is acquiring preset calculation rules, and calculates the total score of each target segmentation and associated word segmentation in the hash table through the preset calculation rules, and calculates the total score of each target word segmentation and associated word segmentation in the hash table. Sort from big to small and from small to big. The top five with the largest total score are the target word segmentation or related word segmentation, and the top five with the largest total score are extracted as the target word segmentation or related word segmentation as the key word of the text to be processed . Thereby reducing errors and improving the accuracy of text keywords.
此外,本申请实施例还提出一种局部优化关键词的提取装置,局部优化关键词的提取装置包括:In addition, an embodiment of the present application also proposes a device for extracting locally optimized keywords. The device for extracting locally optimized keywords includes:
识别单元,用于接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, first paragraph and the last paragraph of the text to be processed;
更新单元,用于基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;The update unit is used to segment the characters in the title, the first paragraph and the end based on the preset Chinese word segmentation system, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word segmentation in the word segmentation set as the key Part of speech
第一记录单元,用于通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;The first recording unit is used to record the weight parameters corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are the part-of-speech score and the word frequency;
第二记录单元,用于遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;The second recording unit is used to traverse the text to be processed, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameters of the related word segmentation in the hash table;
提取单元,用于根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。The extraction unit is used to extract the top five target word segmentation and/or related word segmentation as the keywords of the text to be processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each associated word segmentation in the hash table.
进一步地,上述识别单元具体用于:接收待处理文本,获取待处理文本中空格字符的位置以及空格字符的数量N,其中,空格字符的数量N大于3;Further, the above-mentioned recognition unit is specifically configured to: receive the text to be processed, and obtain the position of the space character in the text to be processed and the number N of space characters, where the number of space characters N is greater than 3;
将第一空格字符位置与第二空格字符位置之间的字符作为待处理文本的标题,将第二空格字符位置与第三空格位置之间的字符作为待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and the character between the second space character position and the third space character position as the first paragraph of the text to be processed, and set N-( N-1) The space between the space character position and the N space character position is used as the end of the text to be processed;
调取预置字符识别程序,识别标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
进一步地,上述更新单元具体用于:在识别到标题、首段和尾段中的字符时,启动预置中文分词系统对标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;Further, the above-mentioned update unit is specifically used for: when the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the characters in the title, the first paragraph and the last paragraph according to nouns, verbs, adjectives, and prepositions. , Punctuation, quantifiers, and neologisms are divided into parts of speech;
获取词性为名词、动词、形容词、介词、标点、量词、新词的字符在中文分词系统中的词性分数对照表中词性分数,将词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose parts of speech are nouns, verbs, adjectives, prepositions, punctuations, quantifiers, and neologisms in the Chinese word segmentation system, and determine the characters with a part-of-speech score greater than 0 as the target participle;
将目标分词进行分词集合,标识分词集合中目标分词的词性为关键词词性。The target word segmentation is classified into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as the keyword part of speech.
进一步地,上述第一记录单元具体用于:调取预置中文分词程序中的词性分数对照表,获取关键词词性在词性分数对照表中对应的分数值;Further, the above-mentioned first recording unit is specifically used to: retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
分别将目标分词作为搜索条件,索引各个目标分词在标题、首段和尾端中的词频,并将各个目标分词的分数值以及词频记录在哈希表中。The target word segmentation is used as the search condition to index the word frequency of each target word segmentation in the title, first paragraph and end, and the score value and word frequency of each target word segmentation are recorded in the hash table.
进一步地,第二记录单元包括:识别子单元,用于通过预置字符识软件遍历待处理文本,识别待处理文本中的字符,预置中文分词系统将待处理文本中的字符切分为多个分词;Further, the second recording unit includes: a recognition subunit for traversing the text to be processed through a preset character recognition software, and recognizing characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple Participle
第一判断子单元,用于提取待处理文本中的第一分词,判断第一分词是否为分词集合中的目标分词;The first judgment subunit is used for extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
第一判定子单元,用于当第一分词为分词集合中的目标分词时,判定第一分词的前面的第二分词和后面的第三分词为目标分词的关联分词,并获取关联分词的词性以及词频;The first determination subunit is used to determine when the first participle is the target participle in the participle set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and obtain the part of speech of the related participle And word frequency;
获取子单元,用于通过比对中文分词系统中的词性分数对照表,获取到关联分词对应的词性分值,并将关联分词的词性分值和词频记录在哈希表中。The acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the hash table.
进一步地,上述局部优化关键词的提取装置,还包括:Further, the above-mentioned device for extracting locally optimized keywords further includes:
第二判断子单元,用于当第一分词不是分词集合中的目标分词时,判断第一分词是否为目标分词的关联分词;The second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;
第二判定子单元,用于当判定第一分词为目标分词的关联分词时,将第一分词的词性和词频记录在哈希表中。The second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.
进一步地,上述提取单元具体用于:Further, the above extraction unit is specifically used for:
获取预置计算规则,计算出哈希表中各个目标分词和关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain the preset calculation rules and calculate the total score of each target word segmentation and associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
通过对哈希表中的总分值按照从大到小或从小到大进行排序,提取总分值前五的目标分词和/或关联分词,并将提取到的总分值前五的目标分词和/或关联分词为待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation, and extract the top five target words with the total score value extracted And/or related word segmentation is the key word of the text to be processed.
上述局部优化关键词的提取装置中各个单元的功能实现与上述局部优化关键词的提取方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。The implementation of the functions of each unit in the device for extracting locally optimized keywords corresponds to the steps in the embodiment of the method for extracting locally optimized keywords, and the functions and implementation processes will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
接收待处理文本,识别待处理文本的标题、首段和尾段中的字符;Receive the text to be processed, and identify the characters in the title, first paragraph and last paragraph of the text to be processed;
基于预置中文分词系统,对标题、首段和尾端中的字符进行切分,并获取标题、首段和尾端的分词集合,更新分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to the keyword part of speech;
通过中文分词系统中的词性分数对照表,将各个目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;Through the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameters corresponding to each target word segmentation are recorded in the preset hash table, where the weight parameters are the part-of-speech score and word frequency;
遍历待处理文本,获取目标分词的关联分词以及关联分词的词性,并将关联分词的权重参数记录在哈希表中;Traverse the text to be processed, obtain the relevant participle of the target word segmentation and the part of speech of the relevant participle, and record the weight parameters of the relevant participle in the hash table;
根据目标分词的关键词词性、各个关联分词的词性在哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part-of-speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. Without more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种局部优化关键词的提取方法,所述局部优化关键词的提取方法包括:A method for extracting locally optimized keywords. The method for extracting locally optimized keywords includes:
    接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
    基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;
    通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,所述权重参数为词性分值和词频;Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
    遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
    根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
  2. 如权利要求1所述的局部优化关键词的提取方法,所述接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符,包括:The method for extracting locally optimized keywords according to claim 1, wherein said receiving the text to be processed and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed includes:
    接收待处理文本,获取所述待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
    将第一空格字符位置与第二空格字符位置之间的字符作为所述待处理文本的标题,将所述第二空格字符位置与所述第三空格位置之间的字符作为所述待处理文本的首段,将Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed The first paragraph of
    N-(N-1)空格字符位置与N空格字符位置之间作为所述待处理文本的尾端;The N-(N-1) space character position and the N space character position are used as the end of the text to be processed;
    调取预置字符识别程序,识别所述标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
  3. 如权利要求2所述的局部优化关键词的提取方法,所述基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端中的分词集合,更新所述分词集合中的目标分词的词性为关键词词性,包括:According to the method for extracting locally optimized keywords according to claim 2, said based on the preset Chinese word segmentation system, the characters in the title, the first paragraph and the end are segmented, and the title, the first paragraph and the characters are obtained. For the word segmentation set at the end, update the part of speech of the target word segment in the word segmentation set to the keyword part of speech, including:
    在识别到所述标题、首段和尾段中的字符时,启动预置中文分词系统对所述标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;
    获取词性为所述名词、动词、形容词、介词、标点、量词、新词的字符在所述中文分词系统中的词性分数对照表中词性分数,将所述词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ;
    将所述目标分词进行分词集合,标识所述分词集合中目标分词的词性为关键词词性。The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
  4. 如权利要求3所述的局部优化关键词的提取方法,所述通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频,包括:The method for extracting locally optimized keywords according to claim 3, wherein the weight parameters corresponding to each target word segmentation are recorded in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, wherein , The weight parameters are part-of-speech score and word frequency, including:
    调取预置中文分词系统中的词性分数对照表,获取所述关键词词性在所述词性分数对照表中对应的分数值;Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
    分别将所述目标分词作为搜索条件,索引各个所述目标分词在所述标题、所述首段和所述尾端中的词频,并将各个所述目标分词的分数值以及词频记录在哈希表中。Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
  5. 如权利要求4所述的局部优化关键词的提取方法,所述遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中,包括:The method for extracting locally optimized keywords according to claim 4, said traversing the text to be processed, obtaining the related participle of the target word segmentation and the part of speech of the related word segmentation, and recording the weight parameters of the related word segmentation In the hash table, include:
    通过所述预置字符识程序遍历所述待处理文本,识别所述待处理文本中的字符,所述预置中文分词系统将所述待处理文本中的字符切分为多个分词;Traversing the text to be processed by the preset character recognition program to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;
    提取所述待处理文本中的第一分词,判断所述第一分词是否为所述分词集合中的目标分词;Extracting the first word segmentation in the to-be-processed text, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
    当所述第一分词为所述分词集合中的目标分词时,判定所述第一分词的前面的第二分词和后面的第三分词为所述目标分词的关联分词,并获取所述关联分词的词性以及词频;When the first participle is the target participle in the word segmentation set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the related participle is obtained The part of speech and word frequency;
    通过比对所述中文分词系统中的词性分数对照表,获取到所述关联分词对应的词性分值,并将所述关联分词的词性分值和词频记录在所述哈希表中。By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
  6. 如权利要求4所述的局部优化关键词的提取方法,所述提取所述待处理文本中的第一分词,判断所述第一分词是否为所述分词集合中的目标分词之后,还包括:5. The method for extracting locally optimized keywords according to claim 4, after extracting the first word segmentation in the text to be processed and judging whether the first word segmentation is the target word segmentation in the word segmentation set, the method further comprises:
    当所述第一分词不是所述分词集合中的目标分词时,判断所述第一分词是否为所述目标分词的关联分词;When the first participle is not the target participle in the word segmentation set, judging whether the first participle is an associated participle of the target participle;
    在判定所述第一分词为所述目标分词的关联分词时,将所述第一分词的词性和词频记录在所述哈希表中。When it is determined that the first participle is an associated participle of the target participle, the part of speech and word frequency of the first participle are recorded in the hash table.
  7. 如权利要求1-6中任意一项所述的局部优化关键词的提取方法,所述根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词,包括:The method for extracting locally optimized keywords according to any one of claims 1 to 6, wherein the weight parameters in the hash table of the keyword part of speech of the target word segmentation and the part of speech of each associated word segmentation in the hash table are extracted. The top five target participles and/or related participles of the total score are the keywords of the text to be processed, including:
    获取预置计算规则,计算出所述哈希表中各个所述目标分词和所述关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain a preset calculation rule, and calculate a total score of each of the target word segmentation and the associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
    通过对所述哈希表中的总分值按照从大到小或从小到大进行排序,提取所述总分值前五的目标分词和/或关联分词,并将提取到的所述总分值前五的目标分词和/或关联分词为所述待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the total score The top five target word segmentation and/or related word segmentation are the keywords of the text to be processed.
  8. 一种局部优化关键词的提取装置,所述局部优化关键词的提取装置包括:A device for extracting locally optimized keywords. The device for extracting locally optimized keywords includes:
    识别单元,用于接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;The recognition unit is used to receive the text to be processed, and to recognize the characters in the title, the first paragraph and the last paragraph of the text to be processed;
    更新单元,用于基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;The update unit is used to segment the characters in the title, the first paragraph and the end based on a preset Chinese word segmentation system, and obtain the word segmentation sets of the title, the first paragraph and the end, and update the word segmentation set in the word segmentation set. The part of speech of the target participle is the keyword part of speech;
    第一记录单元,用于通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频;The first recording unit is configured to record the weight parameter corresponding to each target word segmentation in a preset hash table through the part-of-speech score comparison table in the Chinese word segmentation system, where the weight parameters are part-of-speech score and word frequency;
    第二记录单元,用于遍历所述待处理文本,获取所述目标分词的关联分词以及所述关 联分词的词性,并将所述关联分词的权重参数记录在哈希表中;The second recording unit is used to traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
    提取单元,用于根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。The extraction unit is used to extract the top five target word segmentation and/or related word segmentation with the total score value as the to-be-processed according to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table The keywords of the text.
  9. 如权利要求8所述的局部优化关键词的提取装置,所述识别单元具体用于:According to the device for extracting locally optimized keywords according to claim 8, the recognition unit is specifically configured to:
    接收待处理文本,获取所述待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
    将第一空格字符位置与第二空格字符位置之间的字符作为所述待处理文本的标题,将所述第二空格字符位置与所述第三空格位置之间的字符作为所述待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为所述待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed In the first paragraph of, the N-(N-1) space character position and the N space character position are taken as the end of the text to be processed;
    调取预置字符识别程序,识别所述标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
  10. 如权利要求9所述的局部优化关键词的提取装置,所述更新单元具体用于:According to the device for extracting locally optimized keywords according to claim 9, the updating unit is specifically configured to:
    在识别到所述标题、首段和尾段中的字符时,启动预置中文分词系统对所述标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;
    获取词性为所述名词、动词、形容词、介词、标点、量词、新词的字符在所述中文分词系统中的词性分数对照表中词性分数,将所述词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ;
    将所述目标分词进行分词集合,标识所述分词集合中目标分词的词性为关键词词性。The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
  11. 如权利要求10所述的局部优化关键词的提取装置,所述第一记录单元具体用于:The device for extracting locally optimized keywords according to claim 10, wherein the first recording unit is specifically configured to:
    调取预置中文分词程序中的词性分数对照表,获取所述关键词词性在所述词性分数对照表中对应的分数值;Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation program, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
    分别将所述目标分词作为搜索条件,索引各个所述目标分词在所述标题、所述首段和所述尾端中的词频,并将各个所述目标分词的分数值以及词频记录在哈希表中。Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
  12. 如权利要求11所述的局部优化关键词的提取装置,所述第二记录单元包括:The device for extracting locally optimized keywords according to claim 11, wherein the second recording unit comprises:
    识别子单元,用于通过所述预置字符识软件遍历所述待处理文本,识别所述待处理文本中的字符,所述预置中文分词系统将所述待处理文本中的字符切分为多个分词;The recognition subunit is used to traverse the text to be processed through the preset character recognition software, and recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into Multiple participles;
    第一判断子单元,用于提取所述待处理文本中的第一分词,判断所述第一分词是否为所述分词集合中的目标分词;The first judgment subunit is used to extract the first word segmentation in the text to be processed, and judge whether the first word segmentation is the target word segmentation in the word segmentation set;
    第一判定子单元,用于当所述第一分词为所述分词集合中的目标分词时,判定所述第一分词的前面的第二分词和后面的第三分词为所述目标分词的关联分词,并获取所述关联分词的词性以及词频;The first judging subunit is used for judging that the second participle in front of the first participle and the third participle after the first participle are related to the target participle when the first participle is the target participle in the word segmentation set Word segmentation, and obtain the part of speech and word frequency of the related word segmentation;
    获取子单元,用于通过比对所述中文分词系统中的词性分数对照表,获取到所述关联分词对应的词性分值,并将所述关联分词的词性分值和词频记录在所述哈希表中。The acquiring subunit is used to obtain the part-of-speech score corresponding to the related word segmentation by comparing the part-of-speech score comparison table in the Chinese word segmentation system, and record the part-of-speech score and word frequency of the related word segmentation in the Harbin Hope in the table.
  13. 如权利要求11所述的局部优化关键词的提取装置,所述局部优化关键词的提取 装置,还包括:The device for extracting locally optimized keywords according to claim 11, the device for extracting locally optimized keywords further comprises:
    第二判断子单元,用于当所述第一分词不是所述分词集合中的目标分词时,判断所述第一分词是否为所述目标分词的关联分词;The second judgment subunit is used for judging whether the first participle is the related participle of the target participle when the first participle is not the target participle in the word segmentation set;
    第二判定子单元,用于当判定所述第一分词为所述目标分词的关联分词时,将所述第一分词的词性和词频记录在所述哈希表中。The second determination subunit is used to record the part of speech and word frequency of the first participle in the hash table when determining that the first participle is the related participle of the target participle.
  14. 如权利要求8-13中任意一项所述的局部优化关键词的提取装置,所述提取单元具体用于:According to the device for extracting locally optimized keywords according to any one of claims 8-13, the extracting unit is specifically configured to:
    获取预置计算规则,计算出所述哈希表中各个所述目标分词和所述关联分词的总分值,其中,总分值为词频乘以词性分值;Obtain a preset calculation rule, and calculate a total score of each of the target word segmentation and the associated word segmentation in the hash table, where the total score is the word frequency multiplied by the part-of-speech score;
    通过对所述哈希表中的总分值按照从大到小或从小到大进行排序,提取所述总分值前五的目标分词和/或关联分词,并将提取到的所述总分值前五的目标分词和/或关联分词为所述待处理文本的关键词。By sorting the total scores in the hash table from largest to smallest or from smallest to largest, extract the top five target word segmentation and/or related word segmentation of the total score value, and extract the total score The top five target word segmentation and/or related word segmentation are the keywords of the text to be processed.
  15. 一种局部优化关键词的提取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:A device for extracting locally optimized keywords includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements the following steps when the processor executes the computer program:
    接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
    基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;
    通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,所述权重参数为词性分值和词频;Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
    遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中;Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
    根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
  16. 如权利要求15所述的局部优化关键词的提取设备,所述处理器执行所述计算机程序时实现所述接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符时,包括以下步骤:The device for extracting locally optimized keywords according to claim 15, wherein said processor implements said receiving text to be processed when said computer program is executed, and recognizing characters in the title, first paragraph and last paragraph of said text to be processed When, including the following steps:
    接收待处理文本,获取所述待处理文本中空格字符的位置以及空格字符的数量N,其中,所述空格字符的数量N大于3;Receiving the text to be processed, and obtaining the position of the space character in the text to be processed and the number N of space characters, where the number N of space characters is greater than 3;
    将第一空格字符位置与第二空格字符位置之间的字符作为所述待处理文本的标题,将所述第二空格字符位置与所述第三空格位置之间的字符作为所述待处理文本的首段,将N-(N-1)空格字符位置与N空格字符位置之间作为所述待处理文本的尾端;Use the character between the first space character position and the second space character position as the title of the text to be processed, and use the character between the second space character position and the third space character position as the text to be processed In the first paragraph of, the N-(N-1) space character position and the N space character position are taken as the end of the text to be processed;
    调取预置字符识别程序,识别所述标题、首段和尾段中的字符。Call the preset character recognition program to recognize the characters in the title, first paragraph and last paragraph.
  17. 如权利要求16所述的局部优化关键词的提取设备,所述处理器执行所述计算机 程序时实现所述基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端中的分词集合,更新所述分词集合中的目标分词的词性为关键词词性时,包括以下步骤:The device for extracting locally optimized keywords according to claim 16, when the processor executes the computer program, the processor implements the preset Chinese word segmentation system to cut the characters in the title, the first paragraph, and the end. When the part of speech of the target participle in the word segmentation set is updated to the keyword part of speech, the following steps are included:
    在识别到所述标题、首段和尾段中的字符时,启动预置中文分词系统对所述标题、首段和尾段中的字符按照名词、动词、形容词、介词、标点、量词、新词的词性进行划分;When the characters in the title, the first paragraph and the last paragraph are recognized, the preset Chinese word segmentation system is activated to follow the nouns, verbs, adjectives, prepositions, punctuations, quantifiers, new words in the title, first paragraph and the last paragraph. Part of speech of words is divided;
    获取词性为所述名词、动词、形容词、介词、标点、量词、新词的字符在所述中文分词系统中的词性分数对照表中词性分数,将所述词性分数大于0的字符确定为目标分词;Obtain the part-of-speech scores of the characters whose part-of-speech is the noun, verb, adjective, preposition, punctuation, quantifier, and new word in the part-of-speech score comparison table in the Chinese word segmentation system, and determine the character with the part-of-speech score greater than 0 as the target word segmentation ;
    将所述目标分词进行分词集合,标识所述分词集合中目标分词的词性为关键词词性。The target word segmentation is performed into a word segmentation set, and the part of speech of the target word segmentation in the word segmentation set is identified as a keyword part of speech.
  18. 如权利要求17所述的局部优化关键词的提取设备,所述处理器执行所述计算机程序时实现所述通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,权重参数为词性分值和词频时,包括以下步骤:The device for extracting locally optimized keywords according to claim 17, when the processor executes the computer program, the weights corresponding to each target word segmentation are calculated through the part-of-speech score comparison table in the Chinese word segmentation system. The parameters are recorded in a preset hash table. When the weight parameters are part of speech score and word frequency, the following steps are included:
    调取预置中文分词系统中的词性分数对照表,获取所述关键词词性在所述词性分数对照表中对应的分数值;Retrieve the part-of-speech score comparison table in the preset Chinese word segmentation system, and obtain the corresponding score value of the keyword part of speech in the part-of-speech score comparison table;
    分别将所述目标分词作为搜索条件,索引各个所述目标分词在所述标题、所述首段和所述尾端中的词频,并将各个所述目标分词的分数值以及词频记录在哈希表中。Use the target word segmentation as the search condition, index the word frequency of each target word in the title, the first paragraph and the end, and record the score value and word frequency of each target word in the hash Table.
  19. 如权利要求18所述的局部优化关键词的提取设备,所述处理器执行所述计算机程序时实现所述遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将所述关联分词的权重参数记录在哈希表中时,包括以下步骤:The device for extracting locally optimized keywords according to claim 18, when said processor executes said computer program, said traversal of said to-be-processed text is realized, and the related word segmentation of said target word segmentation and the part of speech of said related word segmentation are obtained , And when the weight parameter of the associated word segmentation is recorded in the hash table, the following steps are included:
    通过所述预置字符识程序遍历所述待处理文本,识别所述待处理文本中的字符,所述预置中文分词系统将所述待处理文本中的字符切分为多个分词;Traversing the text to be processed by the preset character recognition program to recognize characters in the text to be processed, and the preset Chinese word segmentation system divides the characters in the text to be processed into multiple word segmentation;
    提取所述待处理文本中的第一分词,判断所述第一分词是否为所述分词集合中的目标分词;Extracting the first word segmentation in the text to be processed, and judging whether the first word segmentation is the target word segmentation in the word segmentation set;
    当所述第一分词为所述分词集合中的目标分词时,判定所述第一分词的前面的第二分词和后面的第三分词为所述目标分词的关联分词,并获取所述关联分词的词性以及词频;When the first participle is the target participle in the word segmentation set, determine that the second participle in front of the first participle and the third participle after the first participle are related participles of the target participle, and the related participle is obtained The part of speech and word frequency;
    通过比对所述中文分词系统中的词性分数对照表,获取到所述关联分词对应的词性分值,并将所述关联分词的词性分值和词频记录在所述哈希表中。By comparing the part-of-speech score comparison table in the Chinese word segmentation system, the part-of-speech score corresponding to the related word segmentation is obtained, and the part-of-speech score and word frequency of the related word segmentation are recorded in the hash table.
  20. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium stores computer instructions in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer executes the following steps:
    接收待处理文本,识别所述待处理文本的标题、首段和尾段中的字符;Receiving the text to be processed, and recognizing the characters in the title, the first paragraph and the last paragraph of the text to be processed;
    基于预置中文分词系统,对所述标题、首段和尾端中的字符进行切分,并获取所述标题、首段和尾端的分词集合,更新所述分词集合中的目标分词的词性为关键词词性;Based on the preset Chinese word segmentation system, segment the characters in the title, the first paragraph and the end, and obtain the word segmentation set of the title, the first paragraph and the end, and update the part of speech of the target word in the word segmentation set to Keywords part of speech;
    通过所述中文分词系统中的词性分数对照表,将各个所述目标分词对应的权重参数记录在预置哈希表中,其中,所述权重参数为词性分值和词频;Using the part-of-speech score comparison table in the Chinese word segmentation system, the weight parameter corresponding to each target word segmentation is recorded in a preset hash table, where the weight parameters are part-of-speech score and word frequency;
    遍历所述待处理文本,获取所述目标分词的关联分词以及所述关联分词的词性,并将 所述关联分词的权重参数记录在哈希表中;Traverse the to-be-processed text, obtain the related word segmentation of the target word segmentation and the part of speech of the related word segmentation, and record the weight parameter of the related word segmentation in a hash table;
    根据所述目标分词的关键词词性、各个关联分词的词性在所述哈希表中的权重参数,提取分数总值前五的目标分词和/或关联分词为所述待处理文本的关键词。According to the keyword part of speech of the target word segmentation and the weight parameters of the part of speech of each related participle in the hash table, the top five target participles and/or related word segments with the total score are extracted as the keywords of the text to be processed.
PCT/CN2019/118273 2019-09-19 2019-11-14 Method and apparatus for extracting locally optimized keywords, device and storage medium WO2021051599A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910884825.7 2019-09-19
CN201910884825.7A CN110765767B (en) 2019-09-19 2019-09-19 Extraction method, device, server and storage medium of local optimization keywords

Publications (1)

Publication Number Publication Date
WO2021051599A1 true WO2021051599A1 (en) 2021-03-25

Family

ID=69329805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118273 WO2021051599A1 (en) 2019-09-19 2019-11-14 Method and apparatus for extracting locally optimized keywords, device and storage medium

Country Status (2)

Country Link
CN (1) CN110765767B (en)
WO (1) WO2021051599A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282092A (en) * 2021-12-07 2022-04-05 咪咕音乐有限公司 Information processing method, device, equipment and computer readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113378141A (en) * 2021-08-12 2021-09-10 明品云(北京)数据科技有限公司 Text data transmission method, system, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013020439A (en) * 2011-07-11 2013-01-31 Nec Corp Synonym extraction system, method and program
US20140101243A1 (en) * 2012-10-05 2014-04-10 Facebook, Inc. Method and apparatus for identifying common interest between social network users
CN110069599A (en) * 2019-03-13 2019-07-30 平安城市建设科技(深圳)有限公司 Search method, device, equipment and readable storage medium storing program for executing based on approximate word

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239455B (en) * 2016-03-28 2021-06-11 阿里巴巴集团控股有限公司 Core word recognition method and device
CN108304378B (en) * 2018-01-12 2019-09-24 深圳壹账通智能科技有限公司 Text similarity computing method, apparatus, computer equipment and storage medium
CN109086355B (en) * 2018-07-18 2022-05-17 北京航天云路有限公司 Hot-spot association relation analysis method and system based on news subject term
CN109635273B (en) * 2018-10-25 2023-04-25 平安科技(深圳)有限公司 Text keyword extraction method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013020439A (en) * 2011-07-11 2013-01-31 Nec Corp Synonym extraction system, method and program
US20140101243A1 (en) * 2012-10-05 2014-04-10 Facebook, Inc. Method and apparatus for identifying common interest between social network users
CN110069599A (en) * 2019-03-13 2019-07-30 平安城市建设科技(深圳)有限公司 Search method, device, equipment and readable storage medium storing program for executing based on approximate word

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282092A (en) * 2021-12-07 2022-04-05 咪咕音乐有限公司 Information processing method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN110765767B (en) 2024-01-19
CN110765767A (en) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2021174717A1 (en) Text intent recognition method and apparatus, computer device and storage medium
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
US9471644B2 (en) Method and system for scoring texts
WO2019218527A1 (en) Multi-system combined natural language processing method and apparatus
US20120284308A1 (en) Statistical spell checker
WO2017091985A1 (en) Method and device for recognizing stop word
CN108920633B (en) Paper similarity detection method
CN109033212B (en) Text classification method based on similarity matching
US9798776B2 (en) Systems and methods for parsing search queries
CN105760359B (en) Question processing system and method thereof
CN109101551B (en) Question-answer knowledge base construction method and device
WO2017215242A1 (en) Method and device for searching resumes
WO2021051599A1 (en) Method and apparatus for extracting locally optimized keywords, device and storage medium
US8806455B1 (en) Systems and methods for text nuclearization
CN114266256A (en) Method and system for extracting new words in field
US20200192924A1 (en) Natural language query system
CN109977397B (en) News hotspot extracting method, system and storage medium based on part-of-speech combination
CN111126201A (en) Method and device for identifying people in script
WO2023016267A1 (en) Spam comment identification method and apparatus, and device and medium
WO2021051600A1 (en) Method, apparatus and device for identifying new word based on information entropy, and storage medium
KR20220099690A (en) Apparatus, method and computer program for summarizing document
KR102117281B1 (en) Method for generating chatbot utterance using frequency table
CN114444491A (en) New word recognition method and device
JP4934115B2 (en) Keyword extraction apparatus, method and program
JPH06274546A (en) Information quantity matching degree calculation system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946138

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946138

Country of ref document: EP

Kind code of ref document: A1