CN113157929A

CN113157929A - New word mining method and device, server and computer readable storage medium

Info

Publication number: CN113157929A
Application number: CN202011608362.0A
Authority: CN
Inventors: 聂镭; 齐凯杰; 聂颖
Original assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Current assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-07-23

Abstract

The embodiment of the application is suitable for the technical field of data processing, and provides a new word mining method, a device, a server and a computer readable storage medium, wherein the method comprises the following steps: acquiring a text to be mined; determining keywords in a text to be mined; storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relation among the keywords; determining a target phrase based on the incidence relation between the keywords; and screening out new words in the target word group. Therefore, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.

Description

New word mining method and device, server and computer readable storage medium

Technical Field

The present application belongs to the technical field of data processing, and in particular, to a new word mining method, apparatus, server, and computer-readable storage medium.

Background

In the field of Chinese word segmentation, the Chinese sentence can be segmented according to a Chinese word group set in advance, but when a new word appears and the Chinese word group set in advance does not exist, the word segmentation effect is poor. In addition, Chinese does not have the advantage of capitalization unlike English, and how to let a computer recognize nouns such as names of people and places, even names of organizations, brand names, professional nouns, abbreviations, new words of networks and the like which are irregularly found by naming rules. Therefore, automatically discovering new words becomes a key link.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, a server, and a computer-readable storage medium for mining new words, so as to solve the problem in the prior art that new words still need to be mined manually.

A first aspect of an embodiment of the present application provides a new word mining method, including:

acquiring a text to be mined;

determining keywords in the text to be mined;

storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relations among the keywords;

determining a target phrase based on the incidence relation between the keywords;

and screening out new words in the target word group.

In a possible implementation manner of the first aspect, determining keywords in the text to be mined includes:

performing word segmentation processing on the text to be mined to obtain a keyword group;

and carrying out word segmentation processing on the keyword group to obtain keywords.

In a possible implementation manner of the first aspect, the association relationship between the keywords is a frequency value occurring between the keywords at the same time;

determining a target phrase based on the association relationship among the keywords, including:

screening out keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold;

and generating a target phrase between the keywords corresponding to the frequency value of the association relation greater than the frequency threshold value.

In a possible implementation manner of the first aspect, the screening out a new word in the target phrase includes:

and determining the new words according to the comprehensive correlation value corresponding to the target word group.

In one possible implementation manner of the first aspect, the comprehensive correlation value is an internal solidification value;

determining the new word according to the comprehensive association value corresponding to the target phrase, including:

calculating an internal solidification value of the target phrase;

and determining the target word with the internal solidification value larger than the internal solidification threshold value as the new word.

In a possible implementation manner of the first aspect, the comprehensive associated value is an external associated value;

calculating an external correlation value of the target phrase;

and determining the target word group with the external association value larger than the external association threshold value as the new word.

In a possible implementation manner of the first aspect, the comprehensive associated value is an internal solidification value and an external associated value;

determining a target phrase with an internal solidification value larger than an internal solidification threshold value as a candidate new word;

and determining the candidate new word with the external association value larger than the external association threshold value as the new word.

A second aspect of the embodiments of the present application provides a new word mining device, including:

the acquisition module is used for acquiring a text to be mined;

the determining module is used for determining keywords in the text to be mined;

the storage module is used for storing the keywords to a knowledge graph, wherein the knowledge graph comprises the keywords and the incidence relation among the keywords;

the association module is used for determining a target phrase based on the association relation between the keywords;

and the screening module is used for screening out the new words in the target word group.

In a possible implementation manner of the second aspect, the determining module includes:

the first processing unit is used for carrying out word segmentation processing on the text to be mined to obtain a keyword group;

and the second processing unit is used for carrying out word segmentation processing on the keyword group to obtain keywords.

In a possible implementation manner of the second aspect, the association relationship between the keywords is a frequency value occurring simultaneously between the keywords;

the association module comprises:

the screening unit is used for screening out the keywords corresponding to the frequency values of the association relations which are larger than the frequency threshold value;

and the generating unit is used for generating a target phrase between the keywords corresponding to the frequency threshold value according to the frequency value of the association relation.

In a possible implementation manner of the second aspect, the screening module includes:

and the new word determining unit is used for determining the new words according to the comprehensive relevance value corresponding to the target phrase.

In one possible implementation of the second aspect, the composite correlation value is an internal coagulation value;

the new word determination unit includes:

the first calculating subunit is used for calculating the internal solidification value of the target phrase;

and the first determining unit is used for determining that the target word group with the internal solidification value larger than the internal solidification threshold value is the new word.

In a possible implementation manner of the second aspect, the comprehensive associated value is an external associated value;

the new word determination unit includes:

the second calculating subunit is used for calculating the external correlation value of the target phrase;

a first determining unit, configured to determine that the target word group with the external relevance value greater than an external relevance threshold is the new word.

In one possible implementation manner of the second aspect, the comprehensive correlation value is an internal coagulation value and an external correlation value;

the new word determination unit includes:

the third determining unit is used for determining the target word group with the internal solidification value larger than the internal solidification threshold value as a candidate new word;

a fourth determining unit, configured to determine that the candidate new word with the external relevance value greater than the external relevance threshold is the new word.

A third aspect of an embodiment of the present application provides a server, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect as described above when executing the computer program.

A fourth aspect of an embodiment of the present application provides a computer-readable storage medium, including: the computer readable storage medium stores a computer program which, when executed by a processor, performs the steps of the method of the first aspect as described above.

Compared with the prior art, the embodiment of the application has the advantages that:

according to the method and the device, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a new word mining method provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a new word mining device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a server provided by an embodiment of the present application;

fig. 4 is a schematic diagram of the knowledge graph in fig. 1 of a new word mining method provided by an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 1, a schematic flow chart of a new word mining method provided in an embodiment of the present application, where the method is applied to a server including, but not limited to, a computing device such as a cloud server, and the method includes the following steps:

and S101, acquiring a text to be mined.

It can be understood that the text information contents corresponding to different fields are very different, and therefore, if a new word in a certain field is to be obtained, data in the field needs to be collected, and the data can be obtained in the form of crawler or book sorting.

Preferably, the acquired text to be mined may be preprocessed:

(1) the designated useless symbols are removed, for example, many spaces or other useless symbols can be obtained when the text is crawled, and if the symbols are reserved, the symbols can be separated during word segmentation, so that the word segmentation effect is poor;

(2) removing the emoticons in the text, wherein the emoticons of the type can also influence the word segmentation effect;

(3) and the traditional Chinese and the simplified Chinese are converted, when the text is processed, if the traditional Chinese and the simplified Chinese exist in the text, the processing is inconvenient, and therefore, the traditional Chinese and the simplified Chinese are converted according to the actual requirement.

And step S102, determining keywords in the text to be mined.

Illustratively, determining keywords in the text to be mined comprises:

firstly, performing word segmentation processing on a text to be mined to obtain a keyword group.

In specific application, according to the writing habit of the Chinese text, the fact that the position of adding the punctuation mark is not in the middle of the word group can be known, for example, "I love Beijing and I love sugarcoated haws." the punctuation mark is not added in the middle of Beijing in the Chinese text, and the text in step2 is cut according to the characteristic to obtain related sentences. For example, "i love beijing, i love ice candied haws." becomes two sentences "i love beijing" and "i love candied haws".

And secondly, carrying out word segmentation processing on the keyword group to obtain the keywords.

It will be appreciated that the present application will require that the keywords be stored in a knowledge-graph database, such as the neo4j database.

In specific application, each short sentence is stored in a knowledge graph, a storage rule is that adjacent relations between characters are converted into association relations in the graph, the times of simultaneous occurrence between the characters are relational weights, centurie' participles are words with characteristics of verbs and adjectives, and the participles are divided into two types, namely current participles and past participles, and are a non-predicate verb form, and specifically, see fig. 4, wherein numbers in circles represent the frequency of occurrence of the characters, and are generally stored in a database by using attributes of entities, and the text is clearly shown and is directly written in the circles; the number of connecting lines between the entities (words) indicates the frequency of occurrence of two adjacent words.

And step S103, storing the keywords to the knowledge graph.

The knowledge graph comprises keywords and incidence relations among the keywords.

And step S104, determining a target phrase based on the incidence relation among the keywords.

The incidence relation among the keywords is the frequency value which occurs simultaneously among the keywords.

Exemplarily, determining the target phrase based on the association relationship between the keywords comprises:

firstly, screening out keywords corresponding to frequency values of the association relation larger than a frequency threshold value.

And secondly, generating a target phrase between the keywords corresponding to the frequency threshold value according to the frequency value of the association relation.

It is understood that two words are more likely to form a new word when the frequency of occurrence of the two words exceeds a certain threshold, and that two words are not or rarely occurring at the same time and will not substantially form a new word unless a fixed collocation, such as "245428;" is used for a while "in this discussion.

And S105, screening out new words in the target phrase.

Illustratively, filtering out new words in the target phrase includes:

Optionally, in the first embodiment, the comprehensive related value is an internal solidification value;

determining a new word according to the comprehensive association value corresponding to the target phrase, wherein the method comprises the following steps:

firstly, calculating an internal solidification value of a target phrase.

And secondly, determining a target phrase with the internal solidification value larger than the internal solidification threshold value as a new word.

In a specific application, the internal solidification degree of a phrase is defined in the application as the product of the frequency of each word in the phrase, i.e. p (xy solidification degree) = p (xy frequency of occurrence)/(p (x frequency) × p (y frequency)), wherein p (xy frequency of occurrence), p (x frequency), and p (y frequency) can be queried in a knowledge graph.

It will be appreciated that depending on the threshold setting of the degree of coagulation, target phrases meeting the associated threshold may be obtained.

Optionally, in the second embodiment, the integrated correlation value is an external correlation value;

firstly, calculating an external association value of a target phrase.

And secondly, determining the target word group with the external correlation value larger than the external correlation threshold value as a new word.

In specific application, the information entropy is adopted to measure the flexibility of a left adjacent character set and a right adjacent character set of a phrase. Taking the example of 'eating grape without eating grape skin and instead eating grape skin' as an example, the word 'grape' appears four times, wherein the left adjacent characters are { eating, spitting, eating, spitting }, and the right adjacent characters are { not, skin, inverted, skin }, respectively. According to the formula, the information entropy of the left adjacent characters of the word "grape" is- (1/2) · log (1/2) - (1/2) · log (1/2) ≈ 0.693, and the information entropy of the right adjacent characters thereof is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4) · 1.04. It can be seen that the right neighborhood of the word "grape" is richer in this sentence. The method and the device take the minimum value of the information entropy of the left and right adjacent word sets of the phrase as the external association degree of the phrase.

It is understood that the degree of association is not sufficient to consider only the interior of the phrase, but also the exterior. For example, the phrase "ancestor" and "quilt" is used commonly and relatively fixedly, such as "ancestor", "semi-ancestor", "descendant", etc., basically no other characters are added in front of the "ancestor", the usage is relatively limited, the "ancestor" is not a single word, but the "quilt" can be a "quilt", etc., various characters can be added in front, and the usage is relatively flexible. And calculating the external association degree of each target phrase, and judging new words when the external association degree is greater than a threshold value. Whether a phrase can appear flexibly in a variety of different environments is therefore used herein as a criterion for whether a new word can be composed.

Optionally, in a third embodiment, the integrated correlation value is an internal coagulation value and an external correlation value;

and determining the candidate new words with the external relevance values larger than the external relevance threshold value as the new words.

It can be understood that, in the present embodiment, the first embodiment and the second embodiment are considered together, and the target phrases are subjected to deep screening, so that the accuracy of the determined new words is higher.

Preferably, the numerical value of the internal freezing threshold and the external correlation threshold can be continuously optimized according to manual intervention, so that the judgment accuracy of the new word is improved.

In the embodiment of the application, the keywords in the text to be mined are stored in the knowledge graph, and the target phrase is determined based on the incidence relation among the keywords, so that the new words of the target phrase are screened out, the situation that the new words still need to be mined manually in the prior art is avoided, and the effect of automatically mining the new words is achieved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The following describes a new word mining device provided in an embodiment of the present application. The new word mining device and the new word mining method of the embodiment correspond to each other.

Fig. 2 is a schematic structural diagram of a new word mining device provided in an embodiment of the present application, where the device may be specifically integrated in a server, and the device may include:

the acquiring module 21 is used for acquiring a text to be mined;

a determining module 22, configured to determine a keyword in the text to be mined;

a storage module 23, configured to store the keywords to a knowledge graph, where the knowledge graph includes the keywords and an association relationship between the keywords;

the association module 24 is configured to determine a target phrase based on an association relationship between the keywords;

and the screening module 25 is configured to screen out new words in the target phrase.

In one possible implementation, the determining module includes:

In a possible implementation manner, the association relationship between the keywords is a frequency value occurring simultaneously between the keywords;

the association module comprises:

In one possible implementation, the screening module includes:

In one possible implementation, the composite correlation value is an internal coagulation value;

the new word determination unit includes:

In one possible implementation, the comprehensive associated value is an external associated value;

the new word determination unit includes:

In one possible implementation, the integrated correlation value is an internal coagulation value and an external correlation value;

the new word determination unit includes:

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 3 is a schematic diagram of a server 3 provided in an embodiment of the present application. As shown in fig. 3, the server 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30. The steps in the various method embodiments described above are implemented when the computer program 32 is executed by the processor 30. Alternatively, the processor 30 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 32.

Illustratively, the computer program 32 may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 30 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the server 3.

The server 3 may be a computing device such as a cloud server. The server 3 may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of a server 3 and is not meant to be limiting with respect to server 3, and may include more or less components than those shown, or some components in combination, or different components, e.g., server 3 may also include input output devices, network access devices, buses, etc.

The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 31 may be an internal storage unit of the server 3, such as a hard disk or a memory of the server 3. The memory 31 may also be an external storage device of the server 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the server 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the server 3. The memory 31 is used for storing the computer program and other programs and data required by the server 3. The memory 31 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed server and method may be implemented in other ways. For example, the above-described server embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A new word mining method is characterized by comprising the following steps:

acquiring a text to be mined;

determining keywords in the text to be mined;

and screening out new words in the target word group.

2. The method of claim 1, wherein determining keywords in the text to be mined comprises:

3. The method of mining new words according to claim 1, wherein the association between the keywords is a frequency value occurring simultaneously between the keywords;

4. The method of any one of claims 1-3, wherein screening out new words in the target phrase comprises:

5. The new word mining method according to claim 4, wherein the comprehensive correlation value is an internal freezing value;

calculating an internal solidification value of the target phrase;

6. The method of neologism mining of claim 4, wherein the composite relevance value is an external relevance value;

calculating an external correlation value of the target phrase;

7. The new word mining method according to claim 4, wherein the integrated correlation value is an internal coagulation value and an external correlation value;

8. A new word mining device, characterized in that the device comprises:

the acquisition module is used for acquiring a text to be mined;

9. Server comprising a memory, a processor and a computer program stored in said memory and executable on said processor, characterized in that said processor implements the steps of the method according to any of the claims 1 to 7 when executing said computer program.

10. Computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.