CN113051890A - Method for processing domain feature keywords and related device - Google Patents

Method for processing domain feature keywords and related device Download PDF

Info

Publication number
CN113051890A
CN113051890A CN201911377806.1A CN201911377806A CN113051890A CN 113051890 A CN113051890 A CN 113051890A CN 201911377806 A CN201911377806 A CN 201911377806A CN 113051890 A CN113051890 A CN 113051890A
Authority
CN
China
Prior art keywords
text
corpus
domain feature
word
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911377806.1A
Other languages
Chinese (zh)
Inventor
童陈敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201911377806.1A priority Critical patent/CN113051890A/en
Publication of CN113051890A publication Critical patent/CN113051890A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a processing method and a related device of domain feature keywords, wherein the method comprises the steps of firstly obtaining a text corpus of a type and a text corpus of a contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords; then processing the text corpus into a long word set to obtain the text characteristics of each long word in the long word set; and finally, determining the domain feature keywords in the long word set by using the text features. According to the method and the device, various factors influencing the accuracy of the domain keywords can be integrated by utilizing the text features of the class of text corpora, and the domain keywords in the class of text corpora can be screened by contrasting the class of text corpora, so that the accuracy of extracting the domain feature keywords is greatly improved, and the workload of subsequent work is reduced.

Description

Method for processing domain feature keywords and related device
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a related device for processing domain feature keywords.
Background
The method is a common method in the fields of information retrieval, data induction analysis, data audit and the like.
The keywords expressing the characteristics of the domain are usually words specific to the domain, and the keywords should be different between different domains. The accurate domain feature keywords cannot be accurately extracted by adopting a common keyword extraction mode, so that the workload of performing subsequent work such as information retrieval, data induction analysis, data audit and the like by adopting the domain feature keywords is increased.
Therefore, a technical scheme capable of accurately extracting the domain feature keywords is lacked in the prior art, so that the workload of subsequent work is reduced.
Disclosure of Invention
In view of the above problems, the present invention provides a method and a related apparatus for processing domain feature keywords, which overcome the above problems or at least partially solve the above problems, so as to accurately extract domain feature keywords to reduce the workload of subsequent work.
In order to achieve the above purpose, the technical solutions disclosed in the embodiments of the present invention are as follows:
a method for processing domain feature keywords comprises the following steps:
obtaining a text corpus of the type and a text corpus of a contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
processing the text corpus of the type into a long word set;
obtaining text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and determining the domain feature keywords in the long word set by using the text features.
Preferably, the processing the text corpus into a long word set includes:
performing word segmentation processing on the text corpus of the type to obtain a text corpus keyword set;
and splicing the words in the text corpus keyword set according to a splicing rule to obtain a long word set.
Preferably, the text features include:
each long word represents the document space of the same type of the occurrence times in all the documents of the same type of the text corpus and represents the document space of the comparison type of the occurrence times in all the documents of the comparison type of the text corpus;
each long word represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the text and represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the contrast text;
and/or the presence of a gas in the gas,
the word length characterizing each long word length.
Preferably, the determining the domain feature keywords in the long word set by using the text features includes:
obtaining a first difference value between the text document number and the comparison document number;
obtaining a second difference value of the word frequency of the corpus of the current class and the word frequency of the corpus of the contrast class;
taking the first difference value, the second difference value and the word length as input parameters, and calling a preset scoring formula to obtain a comprehensive lead score;
and determining the domain feature keywords according to the comprehensive leading scores.
Preferably, the determining the domain feature keyword according to the comprehensive lead score includes:
calling a preset correction formula to correct the comprehensive lead score to obtain a final score;
and determining the domain feature keywords according to the final scores.
Preferably, the preset scoring formula specifically includes:
docBias=max{baseDocNum-otherDocNum,0};
wordBias=max{baseWrodNum-otherWordNum,0};
Figure BDA0002341450950000021
wherein the docBias represents a first difference, the wordBias represents a second difference, the baseDocNum represents a document number of the class, the otherDocNum represents a document number of the contrast class, the baseWrodNum represents a corpus word frequency of the class, the otherWordNum represents a corpus word frequency of the contrast class, the biasScore represents a comprehensive lead score, and the length represents a word length.
Preferably, the preset correction formula specifically includes:
Figure BDA0002341450950000031
wherein the score characterizes a final score, the biasScore characterizes a composite lead score, and the length characterizes a word length.
Another aspect of the present invention provides a device for processing domain feature keywords, including:
the first obtaining module is used for obtaining the text corpora of the type and the text corpora of the comparison type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
the processing module is used for processing the text corpora of the type into a long word set;
the second obtaining module is used for obtaining the text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and the determining module is used for determining the domain characteristic keywords in the long word set by utilizing the text characteristics.
Another aspect of the invention provides an apparatus comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory to execute the processing method of the domain feature keywords.
The invention further provides a storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the method for processing the domain feature keywords is implemented.
By means of the technical scheme, the method and the related device for processing the domain feature keywords provided by the invention are characterized in that firstly, the method obtains the text corpus of the type and the text corpus of the contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords; then processing the text corpus of the type into a long word set; then obtaining the text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus; and finally, determining the domain feature keywords in the long word set by using the text features. In the embodiment of the invention, the text characteristics of the text corpora can be utilized to synthesize various factors influencing the accuracy of the domain keywords, and the domain keywords in the text corpora can be screened by contrasting the text corpora, so that the accuracy of extracting the domain characteristic keywords is greatly improved, and the workload of subsequent work is reduced.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart illustrating a method for processing a domain feature keyword according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a domain feature keyword processing apparatus according to an embodiment of the present invention;
fig. 3 shows a schematic structural diagram of an apparatus provided by the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method and the device are mainly applied to processing the domain feature keywords. The prior keyword extraction method is not accurate in extracting the domain keywords. For example, in the energy field, the oil and gas group ltd in china is one of the domain keywords in the field, and the extracted keywords may be keywords that cannot represent the domain in china, natural gas, and the like by using the existing extraction method. Leading to a series of problems of inaccurate analysis results of subsequent text data and the like.
Therefore, the invention provides a processing method and a related device for domain feature keywords, which are used for accurately extracting the domain feature keywords. The problems in the subsequent treatment process are avoided.
Referring to fig. 1, fig. 1 is a schematic flowchart of a method for processing a domain feature keyword according to an embodiment of the present invention.
The invention provides a method for processing domain feature keywords, which comprises the following steps:
s101, obtaining a text corpus of the type and a text corpus of a contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
in the embodiment of the present invention, in step S101, the text corpus of the present category and the text corpus of the comparison category are obtained first.
The text corpus may be obtained by parsing a pre-prepared text corpus, or may be obtained text corpus entered by a user.
In the embodiment of the present invention, a comparison type text corpus is provided, where the comparison type text corpus does not include any domain keywords, and the comparison type text corpus may be formed by performing statistics in advance through, for example, a statistical algorithm, or may be entered according to experience. The role is to exclude as much as possible the possibility of obtaining non-domain feature keywords.
In the embodiment of the present invention, the parsing from the document corpus prepared in advance may be to obtain text documents in multiple formats, for example, text documents in formats such as PDF, DOC, PPT, XLS, TXT, and the like. These text documents in various formats are corpus of this type of documents, and may be corpus set from which domain feature keywords need to be extracted.
In the embodiment of the invention, the text corpora of the type can be analyzed to obtain the text corpora of the type. It may also comprise a plurality of documents.
It is understood that the reference-class text corpus may also be obtained by parsing the reference-class document corpus, and of course, the reference-class document corpus may also be a text document in multiple formats as described above. It may also include multiple documents.
In the embodiment of the invention, the document analysis can be carried out on the text documents with different formats, namely the document corpora of the present class and/or the document corpora of the comparison class. And realizing the analysis of the multi-format document.
The specific parsing process may include:
the text format of the text document is determined, for example, PDF, DOC, PPT, XLS, TXT, etc., but may also be a picture format including words, for example, JPG, BMP, etc.
The determining of the text format of the text document may include reading the text document in a binary form to obtain binary characteristic information of the text document, searching the binary characteristic information in a preset binary characteristic information library to obtain a document format corresponding to the binary characteristic information, and using the document format as the text format of the text document.
Thus, the embodiment of the invention can directly identify the document format of the text document in any format.
Then, an analysis engine corresponding to the document format is called to obtain an analysis result of the text document.
It is to be understood that the embodiment of the present invention is not particularly limited as long as the content of the text document can be analyzed.
S102, processing the text corpora of the type into a long word set;
in the embodiment of the invention, the text corpora of the category are processed into a long word set.
Wherein, the processing the text corpus into a long word set comprises:
performing word segmentation processing on the text corpus of the type to obtain a text corpus keyword set;
and splicing the words in the corpus keyword set of the text according to a splicing rule to obtain a long word set.
In the embodiment of the invention, the ending part words are utilized to analyze the text corpora of the type so as to obtain the original keyword set of the text corpora of the type.
In the embodiment of the invention, words in the corpus keyword set of the text are spliced according to a splicing rule, for example, a part-of-speech analysis is performed on an original keyword set to analyze parts-of-speech such as verbs, nouns and conjunctions, and adjacent nouns and vernouns are spliced to form a long word set.
In the embodiment of the invention, the processes of part-of-speech analysis and splicing can be packaged into a splicing program, the splicing program is directly called, and the text corpus keyword set is used as an input parameter to directly obtain an output result comprising a long word set. S103, obtaining text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
in the embodiment of the invention, the text characteristics of each long word in each long word set are obtained.
And the text features represent the frequency, frequency and/or length of the occurrence of the long words in the corpus and the comparison corpus.
The text features include:
each long word represents the document space of the same type of the occurrence times in all the documents of the same type of the text corpus and represents the document space of the comparison type of the occurrence times in all the documents of the comparison type of the text corpus;
each long word represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the text and represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the contrast;
and/or the presence of a gas in the gas,
the word length characterizing each long word length.
In the embodiment of the invention, the occurrence frequency of each long word in all the documents of the corpus of the text is respectively counted as the document space of the word.
For example, there are N documents in this type of text corpus, and a long word set includes "china oil and gas company", and the word is recorded in each of the N documents, so that the long word corresponds to this type of document with an N number of paragraphs.
In this dimension, if it occurs in every document, it means that the probability of the word being a feature domain keyword is high.
In another dimension, the number of occurrences in each document is counted and the largest value is taken as the corpus word frequency.
Still taking the above example as an example, for example, if the "china oil and gas company" appears 8 times in one document and 10 times in another document, the corpus word frequency of the word is 10.
Of course, since the length of the domain keyword is usually not very small, the word length of the word is also obtained as another dimension to be considered.
For example, "china oil and gas company" includes 9 characters, and 9 may be used as the word length of the word.
Therefore, the text features with three different dimensions are obtained in the embodiment of the invention.
It can be understood that text features with more dimensions can be obtained according to actual needs, which is not described herein again.
It will be appreciated that the text characteristics of the long term in the long term set in the reference-like text corpus may also refer to the above process. This is not described in detail.
And S104, determining the domain feature keywords in the long word set by using the text features.
In the embodiment of the invention, the domain feature keywords are determined by using the text features of the dimensions. And screening by using the reference text corpora. To obtain more accurate domain feature keywords.
In the embodiment of the invention, long words which are N before the space of the document of the current class and zero in the space of the contrast class document can be used as the domain feature keywords.
The long words with the word frequency of the category corpus N before and the word frequency of the contrast category corpus zero can also be used as the domain feature keywords.
And a long word with the word length of N can be directly used as the domain feature keyword.
Of course, the final determination result using a single dimension may not be accurate, and therefore, in order to further improve the accuracy, the embodiment of the present invention may also use the several dimensions to perform the comprehensive determination.
The first N long words after the document sections of the type are ranked according to the descending order can be obtained, the first M long words after the first N long words are ranked according to the descending order of the document sections of the type are determined, and the M long words are used as the first candidate domain feature keywords.
The first N long words after the word frequency of the corpus is sequenced from large to small can be obtained, the first M long words after the first N long words are sequenced from small to large in comparison with the word frequency of the corpus, and the M long words are used as the second candidate domain feature keywords.
In the embodiment of the invention, the field feature keywords of which the word lengths are smaller than the preset length are discarded for the first candidate field feature keywords and the second candidate field feature keywords, and the discarded X long words are used as the field feature keywords.
It can be understood that the method for determining the domain feature keyword by using one or more dimensions is not particularly limited as long as the domain feature keyword is determined by using the text feature.
It can be seen that, in the embodiment of the present invention, a plurality of factors affecting the accuracy of the domain keyword may be synthesized by using the text features of the text corpus, for example, the number of times, frequency and/or length of the long word appearing in the text corpus and the comparison text corpus, and the comparison text corpus may screen the domain keyword in the text corpus, so that the accuracy of extracting the domain feature keyword is greatly improved, the workload of subsequent work such as the retrieval workload of information retrieval, the analysis workload of data induction analysis, the auditing workload of data auditing work, and the like is reduced, and the work efficiency of the subsequent work and the accuracy of the processing result are improved.
In the embodiments of the present invention, the accuracy of the results obtained by the methods of the previous embodiments is still not optimal. Based on the above, the invention preferably further provides a determination mode for determining the domain feature keywords in the long word set by using the text features.
Specifically, the determining the domain feature keywords in the long term set by using the text features includes:
obtaining a first difference value between the text document number and the comparison document number;
obtaining a second difference value of the word frequency of the corpus of the current class and the word frequency of the corpus of the contrast class;
taking the first difference value, the second difference value and the word length as input parameters, and calling a preset scoring formula to obtain a comprehensive lead score;
and determining the domain feature keywords according to the comprehensive leading scores.
Determining the domain feature keywords according to the comprehensive lead score comprises:
calling a preset correction formula to correct the comprehensive lead score to obtain a final score;
and determining the domain feature keywords according to the final scores.
The preset scoring formula specifically includes:
docBias=max{baseDocNum-otherDocNum,0};
wordBias=max{baseWrodNum-otherWordNum,0};
Figure BDA0002341450950000091
wherein the docBias represents a first difference, the wordBias represents a second difference, the baseDocNum represents a document number of the class, the otherDocNum represents a document number of the contrast class, the baseWrodNum represents a corpus word frequency of the class, the otherWordNum represents a corpus word frequency of the contrast class, the biasScore represents a comprehensive lead score, and the length represents a word length.
The preset correction formula specifically includes:
Figure BDA0002341450950000101
wherein the score characterizes a final score, the biasScore characterizes a composite lead score, and the length characterizes a word length.
In the embodiment of the invention, in order to make the obtained result more accurate, a process of determining the domain keyword according to the final score is introduced.
In the embodiment of the invention, a scoring formula and a correction formula are preset. The scoring formula is a related algorithm for calculating the text features as input parameters to obtain scores.
According to the above embodiments, the text features are obtained in the embodiments of the present invention, where the text features may be document space of the present category and document space of the comparison category related to the document space, and the corpus word frequency of the present category and the corpus word frequency of the comparison category related to the word frequency, and further include the word length of the word. And performing comprehensive calculation at least according to the parameters of the dimensions to obtain the final score.
In the embodiment of the invention, a first difference value between the number of the present-class document and the number of the comparison-class document is obtained; obtaining a second difference value of the word frequency of the corpus of the current class and the word frequency of the corpus of the contrast class; taking the first difference value, the second difference value and the word length as input parameters, and calling a preset scoring formula to obtain a comprehensive lead score; calling a preset correction formula to correct the comprehensive lead score to obtain a final score; and determining the domain feature keywords according to the final scores.
The first difference is a difference between the document space of the present category and the document space of the comparison category, and it should be noted that, in the embodiment of the present invention, the first difference cannot be less than or equal to 0. The second difference is the difference between the word frequency of the corpus of the present category and the word frequency of the corpus of the comparison category, and similarly, the second difference cannot be less than or equal to 0.
The reason why the first difference and the second difference cannot be less than or equal to 0 is that if less than 0, it means that it cannot be a domain feature keyword.
For example, "china" has a local document space of N in the local text corpus and a local document space of 2N in the comparison text corpus, and obviously the space appearing in the search text corpus exceeds the space in the text corpus from which the domain-specific keywords need to be extracted, so that "china" cannot represent the domain features and cannot be used as the domain feature keywords.
In the embodiment of the invention, the first difference is taken as the highest priority, and the second difference and the word length are used for assisting to determine the final score. It can be understood that, in the embodiment of the present invention, on the premise of ensuring that the weight of the first difference is the highest: the higher the value of the first difference, the higher the score; the longer the word length, the higher the score; the influence of both on the composite score is controlled within the range of (0, 1).
In addition, in the embodiment of the invention, punishment can be carried out on low-frequency leading space and short words.
If the composite lead score (biasScore) is not greater than 5, or the word length is not greater than 5, the final score takes the minimum of the composite lead score and the word length.
In addition, when the scores of two equal-length words are equal to the length of the words, a fine adjustment factor is subtracted, and the scores are ensured to be monotonically increased along with the biasScore.
Therefore, the embodiment of the invention provides a method for determining the domain feature keywords in the long word set more accurately by using the text features. On the basis of ensuring that the score monotonically rises along with the comprehensive lead score, various factors influencing the field characteristic keywords are synthesized, the field characteristic keywords are more accurately obtained, the accuracy is improved, the workload of subsequent work is reduced, and the work efficiency is improved.
Corresponding to the processing method of the domain feature keyword, the embodiment of the invention also provides a processing device of the domain feature keyword.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a device for processing a domain feature keyword according to an embodiment of the present invention.
The invention provides a processing device of domain feature keywords, which comprises:
the first obtaining module 1 is used for obtaining the text corpus of the present class and the text corpus of the comparison class; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
the processing module 2 is used for processing the text corpora of the type into a long word set;
a second obtaining module 3, configured to obtain a text feature of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and the determining module 4 is used for determining the domain feature keywords in the long word set by using the text features.
It can be understood that, for the function implementation of each unit in the device for processing a domain feature keyword disclosed in the embodiment of the present invention, reference may be made to each step in the method for processing a domain feature keyword in the foregoing embodiment, which is not described herein again.
According to the field characteristic keyword processing device, various factors influencing the accuracy of the field keywords can be integrated by utilizing the text characteristics of the text corpus, and the field keywords in the text corpus can be screened by contrasting the text corpus, so that the accuracy of extracting the field characteristic keywords is greatly improved, the workload of subsequent work is reduced, and the work efficiency is improved.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an apparatus provided by the present invention.
The invention provides a device, which comprises at least one processor 701, at least one memory 702 connected with the processor, and a bus 703; the processor 701 and the memory 702 complete communication with each other through the bus 703; the processor 701 is configured to call the program instructions in the memory 702 to execute a method for processing the domain feature keyword as described above.
The invention also provides a storage medium, wherein the storage medium stores computer-executable instructions, and the computer-executable instructions are loaded and executed by a processor to realize the processing method of the domain feature keywords introduced in the embodiment.
It can be understood that, for the function implementation of each unit in the method for processing a domain feature keyword disclosed in the embodiment of the present invention, reference may be made to each step in the method for processing a domain feature keyword in the foregoing embodiment, which is not described herein again.
It can be seen that, according to the processing device for the domain feature keywords, provided by the invention, a plurality of factors influencing the accuracy of the domain keywords can be synthesized by using the text features of the text corpora, and the domain keywords in the text corpora can be screened by referring to the text corpora, so that the accuracy of extracting the domain feature keywords is greatly improved, the workload of subsequent work is reduced, and the work efficiency is improved.
The device for processing the domain feature keywords comprises a processor and a memory, wherein the obtaining unit, the determining unit, the associating unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, the accuracy of extracting the domain feature keywords is improved by adjusting the kernel parameters, the workload of follow-up work is reduced, and the work efficiency is improved.
The embodiment of the invention provides a storage medium, wherein a program is stored on the storage medium, and the program realizes the processing method of the domain feature keywords when being executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the processing method of the domain feature keywords is executed when the program runs.
The embodiment of the invention provides equipment, which comprises at least one processor, at least one memory and a bus, wherein the memory and the bus are connected with the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the legal and legal processing method. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
obtaining a text corpus of the type and a text corpus of a contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
processing the text corpus of the type into a long word set;
obtaining text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and determining the domain feature keywords in the long word set by using the text features.
Preferably, the processing the text corpus into a long word set includes:
performing word segmentation processing on the text corpus of the type to obtain a text corpus keyword set;
and splicing the words in the text corpus keyword set according to a splicing rule to obtain a long word set.
Preferably, the text features include:
each long word represents the document space of the same type of the occurrence times in all the documents of the same type of the text corpus and represents the document space of the comparison type of the occurrence times in all the documents of the comparison type of the text corpus;
each long word represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the text and represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the contrast text;
and/or the presence of a gas in the gas,
the word length characterizing each long word length.
Preferably, the determining the domain feature keywords in the long word set by using the text features includes:
obtaining a first difference value between the text document number and the comparison document number;
obtaining a second difference value of the word frequency of the corpus of the current class and the word frequency of the corpus of the contrast class;
taking the first difference value, the second difference value and the word length as input parameters, and calling a preset scoring formula to obtain a comprehensive lead score;
and determining the domain feature keywords according to the comprehensive leading scores.
Preferably, the determining the domain feature keyword according to the comprehensive lead score includes:
calling a preset correction formula to correct the comprehensive lead score to obtain a final score;
and determining the domain feature keywords according to the final scores.
Preferably, the preset scoring formula specifically includes:
docBias=max{baseDocNum-otherDocNum,0};
wordBias=max{baseWrodNum-otherWordNum,0};
Figure BDA0002341450950000141
wherein the docBias represents a first difference, the wordBias represents a second difference, the baseDocNum represents a document number of the class, the otherDocNum represents a document number of the contrast class, the baseWrodNum represents a corpus word frequency of the class, the otherWordNum represents a corpus word frequency of the contrast class, the biasScore represents a comprehensive lead score, and the length represents a word length.
Preferably, the preset correction formula specifically includes:
Figure BDA0002341450950000142
wherein the score characterizes a final score, the biasScore characterizes a composite lead score, and the length characterizes a word length.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for processing domain feature keywords is characterized by comprising the following steps:
obtaining a text corpus of the type and a text corpus of a contrast type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
processing the text corpus of the type into a long word set;
obtaining text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and determining the domain feature keywords in the long word set by using the text features.
2. The processing method according to claim 1, wherein said processing the text corpus of the present type into a long corpus comprises:
performing word segmentation processing on the text corpus of the type to obtain a text corpus keyword set;
and splicing the words in the text corpus keyword set according to a splicing rule to obtain a long word set.
3. The processing method according to claim 1, wherein the text feature comprises:
each long word represents the document space of the same type of the occurrence times in all the documents of the same type of the text corpus and represents the document space of the comparison type of the occurrence times in all the documents of the comparison type of the text corpus;
each long word represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the text and represents the corpus word frequency with the maximum occurrence frequency in each document of the corpus of the contrast text;
and/or the presence of a gas in the gas,
the word length characterizing each long word length.
4. The processing method according to claim 3, wherein the determining the domain feature keywords in the long word set by using the text features comprises:
obtaining a first difference value between the text document number and the comparison document number;
obtaining a second difference value of the word frequency of the corpus of the current class and the word frequency of the corpus of the contrast class;
taking the first difference value, the second difference value and the word length as input parameters, and calling a preset scoring formula to obtain a comprehensive lead score;
and determining the domain feature keywords according to the comprehensive leading scores.
5. The processing method of claim 4, wherein said determining a domain feature keyword based on said composite lead score comprises:
calling a preset correction formula to correct the comprehensive lead score to obtain a final score;
and determining the domain feature keywords according to the final scores.
6. The processing method according to claim 4, wherein the preset scoring formula specifically comprises:
docBias=max{baseDocNum-otherDocNum,0};
wordBias=max{baseWrodNum-otherWordNum,0};
Figure FDA0002341450940000021
wherein the docBias represents a first difference, the wordBias represents a second difference, the baseDocNum represents a document number of the class, the otherDocNum represents a document number of the contrast class, the baseWrodNum represents a corpus word frequency of the class, the otherWordNum represents a corpus word frequency of the contrast class, the biasScore represents a comprehensive lead score, and the length represents a word length.
7. The processing method according to claim 5, wherein the preset modification formula specifically comprises:
Figure FDA0002341450940000022
wherein the score characterizes a final score, the biasScore characterizes a composite lead score, and the length characterizes a word length.
8. A domain feature keyword processing apparatus, comprising:
the first obtaining module is used for obtaining the text corpora of the type and the text corpora of the comparison type; the text corpus of the type is a text corpus to be processed containing domain feature keywords, and the contrast type text corpus is a text corpus not containing the domain feature keywords;
the processing module is used for processing the text corpora of the type into a long word set;
the second obtaining module is used for obtaining the text characteristics of each long word in the long word set; the text features represent the occurrence frequency, frequency and/or length of long words in the corpus and the contrast corpus;
and the determining module is used for determining the domain characteristic keywords in the long word set by utilizing the text characteristics.
9. An apparatus comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory to execute the processing method of the domain feature keyword according to any one of claims 1 to 7.
10. A storage medium having stored thereon computer-executable instructions, which when loaded and executed by a processor, implement a method for processing domain feature keywords as claimed in any one of claims 1 to 7.
CN201911377806.1A 2019-12-27 2019-12-27 Method for processing domain feature keywords and related device Pending CN113051890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911377806.1A CN113051890A (en) 2019-12-27 2019-12-27 Method for processing domain feature keywords and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911377806.1A CN113051890A (en) 2019-12-27 2019-12-27 Method for processing domain feature keywords and related device

Publications (1)

Publication Number Publication Date
CN113051890A true CN113051890A (en) 2021-06-29

Family

ID=76506548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911377806.1A Pending CN113051890A (en) 2019-12-27 2019-12-27 Method for processing domain feature keywords and related device

Country Status (1)

Country Link
CN (1) CN113051890A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260359A (en) * 2015-10-16 2016-01-20 晶赞广告(上海)有限公司 Semantic keyword extraction method and apparatus
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
US20190163690A1 (en) * 2016-11-10 2019-05-30 Tencent Technology (Shenzhen) Company Limited Keyword extraction method, apparatus and server
CN110362827A (en) * 2019-07-11 2019-10-22 腾讯科技(深圳)有限公司 A kind of keyword extracting method, device and storage medium
CN110489757A (en) * 2019-08-26 2019-11-22 北京邮电大学 A kind of keyword extracting method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260359A (en) * 2015-10-16 2016-01-20 晶赞广告(上海)有限公司 Semantic keyword extraction method and apparatus
CN105426360A (en) * 2015-11-12 2016-03-23 中国建设银行股份有限公司 Keyword extracting method and device
US20190163690A1 (en) * 2016-11-10 2019-05-30 Tencent Technology (Shenzhen) Company Limited Keyword extraction method, apparatus and server
CN110362827A (en) * 2019-07-11 2019-10-22 腾讯科技(深圳)有限公司 A kind of keyword extracting method, device and storage medium
CN110489757A (en) * 2019-08-26 2019-11-22 北京邮电大学 A kind of keyword extracting method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RICARDO CAMPOS 等: "A Text Feature Based Automatic Keyword Extraction Method for Single Documents", pages 684, Retrieved from the Internet <URL:《https://doi.org/10.1007/978-3-319-76941-7_63》> *
尤苡名: "基于TextRank的产品评论关键词抽取方法研究", 软件导刊, vol. 19, no. 04, pages 229 - 233 *

Similar Documents

Publication Publication Date Title
CN106055574B (en) Method and device for identifying illegal uniform resource identifier (URL)
CN110162750B (en) Text similarity detection method, electronic device and computer readable storage medium
US10452691B2 (en) Method and apparatus for generating search results using inverted index
CN106033416B (en) Character string processing method and device
CN110019668A (en) A kind of text searching method and device
CN107357777B (en) Method and device for extracting label information
US11775549B2 (en) Method and system for document indexing and retrieval
CN107368489B (en) Information data processing method and device
CN110674635B (en) Method and device for dividing text paragraphs
CN110196910B (en) Corpus classification method and apparatus
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN113761161A (en) Text keyword extraction method and device, computer equipment and storage medium
CN111160445B (en) Bid file similarity calculation method and device
WO2021103594A1 (en) Tacitness degree detection method and device, server and readable storage medium
CN109918661B (en) Synonym acquisition method and device
CN112487181B (en) Keyword determination method and related equipment
CN110427626B (en) Keyword extraction method and device
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
CN113051890A (en) Method for processing domain feature keywords and related device
CN115563268A (en) Text abstract generation method and device, electronic equipment and storage medium
CN110955845A (en) User interest identification method and device, and search result processing method and device
CN110968691B (en) Judicial hotspot determination method and device
CN113590805A (en) Method and device for searching textile commodity names based on knowledge graph
CN111708891B (en) Food material entity linking method and device between multi-source food material data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination