CN113157929A - New word mining method and device, server and computer readable storage medium - Google Patents

New word mining method and device, server and computer readable storage medium Download PDF

Info

Publication number
CN113157929A
CN113157929A CN202011608362.0A CN202011608362A CN113157929A CN 113157929 A CN113157929 A CN 113157929A CN 202011608362 A CN202011608362 A CN 202011608362A CN 113157929 A CN113157929 A CN 113157929A
Authority
CN
China
Prior art keywords
keywords
value
determining
word
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011608362.0A
Other languages
Chinese (zh)
Inventor
聂镭
齐凯杰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Original Assignee
Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Longma Zhixin Zhuhai Hengqin Technology Co ltd filed Critical Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority to CN202011608362.0A priority Critical patent/CN113157929A/en
Publication of CN113157929A publication Critical patent/CN113157929A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The embodiment of the application is suitable for the technical field of data processing, and provides a new word mining method, a device, a server and a computer readable storage medium, wherein the method comprises the following steps: acquiring a text to be mined; determining keywords in a text to be mined; storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relation among the keywords; determining a target phrase based on the incidence relation between the keywords; and screening out new words in the target word group. Therefore, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.

Description

New word mining method and device, server and computer readable storage medium
Technical Field
The present application belongs to the technical field of data processing, and in particular, to a new word mining method, apparatus, server, and computer-readable storage medium.
Background
In the field of Chinese word segmentation, the Chinese sentence can be segmented according to a Chinese word group set in advance, but when a new word appears and the Chinese word group set in advance does not exist, the word segmentation effect is poor. In addition, Chinese does not have the advantage of capitalization unlike English, and how to let a computer recognize nouns such as names of people and places, even names of organizations, brand names, professional nouns, abbreviations, new words of networks and the like which are irregularly found by naming rules. Therefore, automatically discovering new words becomes a key link.
Disclosure of Invention
In view of this, embodiments of the present application provide a method, an apparatus, a server, and a computer-readable storage medium for mining new words, so as to solve the problem in the prior art that new words still need to be mined manually.
A first aspect of an embodiment of the present application provides a new word mining method, including:
acquiring a text to be mined;
determining keywords in the text to be mined;
storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relations among the keywords;
determining a target phrase based on the incidence relation between the keywords;
and screening out new words in the target word group.
In a possible implementation manner of the first aspect, determining keywords in the text to be mined includes:
performing word segmentation processing on the text to be mined to obtain a keyword group;
and carrying out word segmentation processing on the keyword group to obtain keywords.
In a possible implementation manner of the first aspect, the association relationship between the keywords is a frequency value occurring between the keywords at the same time;
determining a target phrase based on the association relationship among the keywords, including:
screening out keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold;
and generating a target phrase between the keywords corresponding to the frequency value of the association relation greater than the frequency threshold value.
In a possible implementation manner of the first aspect, the screening out a new word in the target phrase includes:
and determining the new words according to the comprehensive correlation value corresponding to the target word group.
In one possible implementation manner of the first aspect, the comprehensive correlation value is an internal solidification value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
calculating an internal solidification value of the target phrase;
and determining the target word with the internal solidification value larger than the internal solidification threshold value as the new word.
In a possible implementation manner of the first aspect, the comprehensive associated value is an external associated value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
calculating an external correlation value of the target phrase;
and determining the target word group with the external association value larger than the external association threshold value as the new word.
In a possible implementation manner of the first aspect, the comprehensive associated value is an internal solidification value and an external associated value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
determining a target phrase with an internal solidification value larger than an internal solidification threshold value as a candidate new word;
and determining the candidate new word with the external association value larger than the external association threshold value as the new word.
A second aspect of the embodiments of the present application provides a new word mining device, including:
the acquisition module is used for acquiring a text to be mined;
the determining module is used for determining keywords in the text to be mined;
the storage module is used for storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relation among the keywords;
the association module is used for determining a target phrase based on the association relation between the keywords;
and the screening module is used for screening out the new words in the target word group.
In a possible implementation manner of the second aspect, the determining module includes:
the first processing unit is used for carrying out word segmentation processing on the text to be mined to obtain a keyword group;
and the second processing unit is used for carrying out word segmentation processing on the keyword group to obtain keywords.
In a possible implementation manner of the second aspect, the association relationship between the keywords is a frequency value occurring simultaneously between the keywords;
the association module comprises:
the screening unit is used for screening out the keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold value;
and the generating unit is used for generating a target phrase between the keywords corresponding to the frequency threshold value according to the frequency value of the association relation.
In a possible implementation manner of the second aspect, the screening module includes:
and the new word determining unit is used for determining the new words according to the comprehensive relevance value corresponding to the target phrase.
In one possible implementation of the second aspect, the composite correlation value is an internal coagulation value;
the new word determination unit includes:
the first calculating subunit is used for calculating the internal solidification value of the target phrase;
and the first determining unit is used for determining that the target word group with the internal solidification value larger than the internal solidification threshold value is the new word.
In a possible implementation manner of the second aspect, the comprehensive associated value is an external associated value;
the new word determination unit includes:
the second calculating subunit is used for calculating the external correlation value of the target phrase;
a first determining unit, configured to determine that the target word group with the external relevance value greater than an external relevance threshold is the new word.
In one possible implementation manner of the second aspect, the comprehensive correlation value is an internal coagulation value and an external correlation value;
the new word determination unit includes:
the third determining unit is used for determining the target word group with the internal solidification value larger than the internal solidification threshold value as a candidate new word;
a fourth determining unit, configured to determine that the candidate new word with the external relevance value greater than the external relevance threshold is the new word.
A third aspect of an embodiment of the present application provides a server, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect as described above when executing the computer program.
A fourth aspect of an embodiment of the present application provides a computer-readable storage medium, including: the computer readable storage medium stores a computer program which, when executed by a processor, performs the steps of the method of the first aspect as described above.
Compared with the prior art, the embodiment of the application has the advantages that:
according to the method and the device, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a new word mining method provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a new word mining device according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a server provided by an embodiment of the present application;
fig. 4 is a schematic diagram of the knowledge graph in fig. 1 of a new word mining method provided by an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.
Referring to fig. 1, a schematic flow chart of a new word mining method provided in an embodiment of the present application, where the method is applied to a server including, but not limited to, a computing device such as a cloud server, and the method includes the following steps:
and S101, acquiring a text to be mined.
It can be understood that the text information contents corresponding to different fields are very different, and therefore, if a new word in a certain field is to be obtained, data in the field needs to be collected, and the data can be obtained in the form of crawler or book sorting.
Preferably, the acquired text to be mined may be preprocessed:
(1) the designated useless symbols are removed, for example, many spaces or other useless symbols can be obtained when the text is crawled, and if the symbols are reserved, the symbols can be separated during word segmentation, so that the word segmentation effect is poor;
(2) removing the emoticons in the text, wherein the emoticons of the type can also influence the word segmentation effect;
(3) and the traditional Chinese and the simplified Chinese are converted, when the text is processed, if the traditional Chinese and the simplified Chinese exist in the text, the processing is inconvenient, and therefore, the traditional Chinese and the simplified Chinese are converted according to the actual requirement.
And step S102, determining keywords in the text to be mined.
Illustratively, determining keywords in the text to be mined comprises:
firstly, performing word segmentation processing on a text to be mined to obtain a keyword group.
In specific application, according to the writing habit of the Chinese text, the fact that the position of adding the punctuation mark is not in the middle of the word group can be known, for example, "I love Beijing and I love sugarcoated haws." the punctuation mark is not added in the middle of Beijing in the Chinese text, and the text in step2 is cut according to the characteristic to obtain related sentences. For example, "i love beijing, i love ice candied haws." becomes two sentences "i love beijing" and "i love candied haws".
And secondly, carrying out word segmentation processing on the keyword group to obtain the keywords.
It will be appreciated that the present application will require that the keywords be stored in a knowledge-graph database, such as the neo4j database.
In specific application, each short sentence is stored in a knowledge graph, a storage rule is that adjacent relations between characters are converted into association relations in the graph, the times of simultaneous occurrence between the characters are relational weights, centurie' participles are words with characteristics of verbs and adjectives, and the participles are divided into two types, namely current participles and past participles, and are a non-predicate verb form, and specifically, see fig. 4, wherein numbers in circles represent the frequency of occurrence of the characters, and are generally stored in a database by using attributes of entities, and the text is clearly shown and is directly written in the circles; the number of connecting lines between the entities (words) indicates the frequency of occurrence of two adjacent words.
And step S103, storing the keywords to the knowledge graph.
The knowledge graph comprises keywords and incidence relations among the keywords.
And step S104, determining a target phrase based on the incidence relation among the keywords.
The incidence relation among the keywords is the frequency value which occurs simultaneously among the keywords.
Exemplarily, determining the target phrase based on the association relationship between the keywords comprises:
firstly, screening out keywords corresponding to frequency values of the association relation larger than a frequency threshold value.
And secondly, generating a target phrase between the keywords corresponding to the frequency threshold value according to the frequency value of the association relation.
It is understood that two words are more likely to form a new word when the frequency of occurrence of the two words exceeds a certain threshold, and that two words are not or rarely occurring at the same time and will not substantially form a new word unless a fixed collocation, such as "245428;" is used for a while "in this discussion.
And S105, screening out new words in the target phrase.
Illustratively, filtering out new words in the target phrase includes:
and determining the new words according to the comprehensive correlation value corresponding to the target word group.
Optionally, in the first embodiment, the comprehensive related value is an internal solidification value;
determining a new word according to the comprehensive association value corresponding to the target phrase, wherein the method comprises the following steps:
firstly, calculating an internal solidification value of a target phrase.
And secondly, determining a target phrase with the internal solidification value larger than the internal solidification threshold value as a new word.
In a specific application, the internal solidification degree of a phrase is defined in the application as the product of the frequency of each word in the phrase, i.e. p (xy solidification degree) = p (xy frequency of occurrence)/(p (x frequency) × p (y frequency)), wherein p (xy frequency of occurrence), p (x frequency), and p (y frequency) can be queried in a knowledge graph.
It will be appreciated that depending on the threshold setting of the degree of coagulation, target phrases meeting the associated threshold may be obtained.
Optionally, in the second embodiment, the integrated correlation value is an external correlation value;
determining a new word according to the comprehensive association value corresponding to the target phrase, wherein the method comprises the following steps:
firstly, calculating an external association value of a target phrase.
And secondly, determining the target word group with the external correlation value larger than the external correlation threshold value as a new word.
In specific application, the information entropy is adopted to measure the flexibility of a left adjacent character set and a right adjacent character set of a phrase. Taking the example of 'eating grape without eating grape skin and instead eating grape skin' as an example, the word 'grape' appears four times, wherein the left adjacent characters are { eating, spitting, eating, spitting }, and the right adjacent characters are { not, skin, inverted, skin }, respectively. According to the formula, the information entropy of the left adjacent characters of the word "grape" is- (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.693, and the information entropy of the right adjacent characters thereof is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) · 1.04. It can be seen that the right neighborhood of the word "grape" is richer in this sentence. The method and the device take the minimum value of the information entropy of the left and right adjacent word sets of the phrase as the external association degree of the phrase.
It is understood that the degree of association is not sufficient to consider only the interior of the phrase, but also the exterior. For example, the phrase "ancestor" and "quilt" is used commonly and relatively fixedly, such as "ancestor", "semi-ancestor", "descendant", etc., basically no other characters are added in front of the "ancestor", the usage is relatively limited, the "ancestor" is not a single word, but the "quilt" can be a "quilt", etc., various characters can be added in front, and the usage is relatively flexible. And calculating the external association degree of each target phrase, and judging new words when the external association degree is greater than a threshold value. Whether a phrase can appear flexibly in a variety of different environments is therefore used herein as a criterion for whether a new word can be composed.
Optionally, in a third embodiment, the integrated correlation value is an internal coagulation value and an external correlation value;
determining a new word according to the comprehensive association value corresponding to the target phrase, wherein the method comprises the following steps:
determining a target phrase with an internal solidification value larger than an internal solidification threshold value as a candidate new word;
and determining the candidate new words with the external relevance values larger than the external relevance threshold value as the new words.
It can be understood that, in the present embodiment, the first embodiment and the second embodiment are considered together, and the target phrases are subjected to deep screening, so that the accuracy of the determined new words is higher.
Preferably, the numerical value of the internal freezing threshold and the external correlation threshold can be continuously optimized according to manual intervention, so that the judgment accuracy of the new word is improved.
In the embodiment of the application, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The following describes a new word mining device provided in an embodiment of the present application. The new word mining device and the new word mining method of the embodiment correspond to each other.
Fig. 2 is a schematic structural diagram of a new word mining device provided in an embodiment of the present application, where the device may be specifically integrated in a server, and the device may include:
the acquiring module 21 is used for acquiring a text to be mined;
a determining module 22, configured to determine a keyword in the text to be mined;
a storage module 23, configured to store the keywords to a knowledge graph, where the knowledge graph includes the keywords and an association relationship between the keywords;
the association module 24 is configured to determine a target phrase based on an association relationship between the keywords;
and the screening module 25 is configured to screen out new words in the target phrase.
In one possible implementation, the determining module includes:
the first processing unit is used for carrying out word segmentation processing on the text to be mined to obtain a keyword group;
and the second processing unit is used for carrying out word segmentation processing on the keyword group to obtain keywords.
In a possible implementation manner, the association relationship between the keywords is a frequency value occurring simultaneously between the keywords;
the association module comprises:
the screening unit is used for screening out the keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold value;
and the generating unit is used for generating a target phrase between the keywords corresponding to the frequency threshold value according to the frequency value of the association relation.
In one possible implementation, the screening module includes:
and the new word determining unit is used for determining the new words according to the comprehensive relevance value corresponding to the target phrase.
In one possible implementation, the composite correlation value is an internal coagulation value;
the new word determination unit includes:
the first calculating subunit is used for calculating the internal solidification value of the target phrase;
and the first determining unit is used for determining that the target word group with the internal solidification value larger than the internal solidification threshold value is the new word.
In one possible implementation, the comprehensive associated value is an external associated value;
the new word determination unit includes:
the second calculating subunit is used for calculating the external correlation value of the target phrase;
a first determining unit, configured to determine that the target word group with the external relevance value greater than an external relevance threshold is the new word.
In one possible implementation, the integrated correlation value is an internal coagulation value and an external correlation value;
the new word determination unit includes:
the third determining unit is used for determining the target word group with the internal solidification value larger than the internal solidification threshold value as a candidate new word;
a fourth determining unit, configured to determine that the candidate new word with the external relevance value greater than the external relevance threshold is the new word.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
Fig. 3 is a schematic diagram of a server 3 provided in an embodiment of the present application. As shown in fig. 3, the server 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30. The steps in the various method embodiments described above are implemented when the computer program 32 is executed by the processor 30. Alternatively, the processor 30 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 32.
Illustratively, the computer program 32 may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 30 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the server 3.
The server 3 may be a computing device such as a cloud server. The server 3 may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of a server 3 and is not meant to be limiting with respect to server 3, and may include more or less components than those shown, or some components in combination, or different components, e.g., server 3 may also include input output devices, network access devices, buses, etc.
The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 31 may be an internal storage unit of the server 3, such as a hard disk or a memory of the server 3. The memory 31 may also be an external storage device of the server 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the server 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the server 3. The memory 31 is used for storing the computer program and other programs and data required by the server 3. The memory 31 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed server and method may be implemented in other ways. For example, the above-described server embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A new word mining method is characterized by comprising the following steps:
acquiring a text to be mined;
determining keywords in the text to be mined;
storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relations among the keywords;
determining a target phrase based on the incidence relation between the keywords;
and screening out new words in the target word group.
2. The method of claim 1, wherein determining keywords in the text to be mined comprises:
performing word segmentation processing on the text to be mined to obtain a keyword group;
and carrying out word segmentation processing on the keyword group to obtain keywords.
3. The method of mining new words according to claim 1, wherein the association between the keywords is a frequency value occurring simultaneously between the keywords;
determining a target phrase based on the association relationship among the keywords, including:
screening out keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold;
and generating a target phrase between the keywords corresponding to the frequency value of the association relation greater than the frequency threshold value.
4. The method of any one of claims 1-3, wherein screening out new words in the target phrase comprises:
and determining the new words according to the comprehensive correlation value corresponding to the target word group.
5. The new word mining method according to claim 4, wherein the comprehensive correlation value is an internal freezing value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
calculating an internal solidification value of the target phrase;
and determining the target word with the internal solidification value larger than the internal solidification threshold value as the new word.
6. The method of neologism mining of claim 4, wherein the composite relevance value is an external relevance value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
calculating an external correlation value of the target phrase;
and determining the target word group with the external association value larger than the external association threshold value as the new word.
7. The new word mining method according to claim 4, wherein the integrated correlation value is an internal coagulation value and an external correlation value;
determining the new word according to the comprehensive association value corresponding to the target phrase, including:
determining a target phrase with an internal solidification value larger than an internal solidification threshold value as a candidate new word;
and determining the candidate new word with the external association value larger than the external association threshold value as the new word.
8. A new word mining device, characterized in that the device comprises:
the acquisition module is used for acquiring a text to be mined;
the determining module is used for determining keywords in the text to be mined;
the storage module is used for storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relation among the keywords;
the association module is used for determining a target phrase based on the association relation between the keywords;
and the screening module is used for screening out the new words in the target word group.
9. Server comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any of the claims 1 to 7 when executing said computer program.
10. Computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202011608362.0A 2020-12-30 2020-12-30 New word mining method and device, server and computer readable storage medium Pending CN113157929A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011608362.0A CN113157929A (en) 2020-12-30 2020-12-30 New word mining method and device, server and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011608362.0A CN113157929A (en) 2020-12-30 2020-12-30 New word mining method and device, server and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113157929A true CN113157929A (en) 2021-07-23

Family

ID=76878111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011608362.0A Pending CN113157929A (en) 2020-12-30 2020-12-30 New word mining method and device, server and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113157929A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy

Similar Documents

Publication Publication Date Title
CN107766328B (en) Text information extraction method of structured text, storage medium and server
CN111831804B (en) Method and device for extracting key phrase, terminal equipment and storage medium
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
US8577155B2 (en) System and method for duplicate text recognition
CN108021558A (en) Keyword recognition method and device, electronic equipment and storage medium
CN108959559A (en) Question and answer are to generation method and device
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN110888981B (en) Title-based document clustering method and device, terminal equipment and medium
Alshalabi et al. Arabic light-based stemmer using new rules
CN112328857B (en) Product knowledge aggregation method and device, computer equipment and storage medium
CN111177375A (en) Electronic document classification method and device
CN112631436A (en) Method and device for filtering sensitive words of input method
CN108427667B (en) Legal document segmentation method and device
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
CN112783825A (en) Data archiving method, data archiving device, computer device and storage medium
CN116340365B (en) Cache data matching method, cache data matching device and terminal equipment
CN112613296A (en) News importance degree acquisition method and device, terminal equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113157929A (en) New word mining method and device, server and computer readable storage medium
CN110688472A (en) Method for automatically screening answers to questions, terminal equipment and storage medium
CN109511000A (en) Barrage classification determines method, apparatus, equipment and storage medium
CN111967240B (en) Text parsing method, text parsing device, terminal equipment and computer readable storage medium
CN109635290B (en) Method, apparatus, device and medium for processing information
CN110222334B (en) Theme correlation determination method and device, storage medium and terminal equipment
CN109284279B (en) Interrogation problem selection method, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723